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Let T be a class of measurable functions / : 5* i — > [0, 1] defined on 
a probability space (S,A,P). Given a sample (Xi, . . . , X n ) of i.i.d. 
random variables taking values in S with common distribution P, let 
P n denote the empirical measure based on (Xi, . . . , X n ). We study 
an empirical risk minimization problem P n f — > min, / 6E T. Given a 
solution f n of this problem, the goal is to obtain very general upper 
bounds on its excess risk 

e P (fn):=Pfn- inf Pf, 

expressed in terms of relevant geometric parameters of the class T. 
Using concentration inequalities and other empirical processes tools, 
we obtain both distribution-dependent and data-dependent upper 
bounds on the excess risk that are of asymptotically correct order in 
many examples. The bounds involve localized sup-norms of empirical 
and Rademacher processes indexed by functions from the class. We 
use these bounds to develop model selection techniques in abstract 
risk minimization problems that can be applied to more specialized 
frameworks of regression and classification. 

1. Introduction. Let (S, A, P) be a probability space and let T be a class 
of measurable functions / ' : S ' i— ► [0,1]. Let (Xi , . . . , X n ) be a sample of i.i.d. 
random variables denned on a probability space (O, S,P) and taking values 
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in S with common distribution P. Let P n denote the empirical measure 
based on the sample (X\, . . . , X n ). 

We consider the problem of risk minimization 

(1.1) P/^min, ft? 

under the assumption that the distribution P is unknown and has to be 
replaced by its estimate, P n . Thus, the true risk minimization is replaced by 
the empirical risk minimization: 

(1.2) P n f min, / € T. 



Definition. Let 

S(f) := £ P (f) := E P {T- f) : = Pf - M Pg. 

This quantity will be called the excess risk of / G J-. The set J-p{5) : = {/ € 
J- :£p(f) < 5} will be called the 5-minimal set of P. In particular, J~p(0) is 
the minimal set of P. 



Given a solution (or an approximate solution) f = f n of (1.2), the first 
problem of interest is to provide very general upper confidence bounds on 
the excess risk £p(f n ) of f n that take into account some relevant geometric 
parameters of the class T as well as some measures of accuracy of approxi- 
mation of P by P n locally in the class. Namely, based on the L2(-P)-diameter 
Dp(J-;5) of the 5-minimal set J 7 (5) and the function 

n (F;5):=E sup \(P n - P)(f - g)\, 

we construct a quantity S n (J-;t) such that inequalities of the following type 
hold: 

F{£ P (f n )>5 n (F;t)}<logje-\ t>0 

(see Section 3). The bound 5 n (J-; t) has an asymptotically correct order (with 
respect to n) in many particular examples of risk minimization problems 
occurring in regression, classification and machine learning. However, if the 
diameter Dp (J-;S) does not tend to as 6 — > (which is the case when 
the risk minimization problem has multiple solutions), it happens that the 
bound 8 n (J-\ t) is no longer tight, and one has to redefine it using more subtle 
characteristics of geometry of the class than Dp (T; 5) (see Section 4). 

We will now describe a heuristic way to derive such bounds. It is based 
on iterative localization of the bound and it can be made precise (see the 
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remark after the proof of Theorem 2 in Section 9 and also [27] where this 
type of argument was introduced in a more specialized setting). Define 



U n (5; t) :=KU„(f;i) + D(F; S)^ + 



t t 

n 



It follows from Talagrand's concentration inequality (see Section 2.1) that 
with some constant K > for all i > 

P{ sup |(P n -P)(/-0)|>^„($;*)}<e-*. 

Take 6(°> = 1, so that ^(5^) =T (recall that functions in T take values 
in [0,1]). Assume, for simplicity, that the minimum of Pf is attained at 
/ G T. Since /,/ E T{b^) and P n f < P n f, we have with probability at 
least 1 — e~ l 

£p(f) = Pf - Pf = Pnf ~ Pnf+ (P ~ Pn)(f ~ f) 

< sup \(P n -P)(f-g)\<U n (5^;t)M=:5^. 

This implies that /, / € T (5^) and we can repeat the above argument to 
show that with probability at least 1 - 2e~*, £p{f) < U n (5 {1) ;t) A 1 =: <5 (2) . 
Iterating the argument N times shows that with probability at least 1 — iVe~* 
we have 8 P {f) < 5^ N \ where 5^ := U n ^ N -^;t) A 1. If the sequence 6W 
converges to the solution 5 of the fixed point equation 5 = U n (5; t) A 1 and if 
the convergence is fast enough so that with some C > 1 for relatively small 
N we have 5( N ' < C5, the above argument shows that £p(f) < C5 with 
probability at least 1 — Ne~ l . Both with and without this iterative argument, 
we show in Section 3 (and prove in Section 9) that the construction of 
good upper bounds on the excess risk of / is related to fixed point-type 
equations for function U n (S;t). The fixed point method has been developed 
in recent years in Massart [36], Koltchinskii and Panchenko [27] and Bartlett, 
Bousquet and Mendelson [5] (and in several other papers of these authors). 

The second problem is to develop ratio-type inequalities for the excess 
risk, namely, to bound the following probabilities: 



{ ■» It??- 1 



>e 



(see Section 3). This problem is an important ingredient of the analysis of 
empirical risk minimization [in particular, we will use inequalities for such 
probabilities in our construction of data-dependent bounds on the excess 
risk £p(f)] and it is related to the study of ratio-type empirical processes 
(see [19, 20] for recent results on this subject). 
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The third problem is to construct data-dependent upper confidence bounds 
on £p(fn)- To this end, we replace the geometric parameters of the class 
[such as Dp(J-;S)] by their empirical versions and the empirical process 
involved in the definition of data-dependent bounds by the Rademacher 
process (Section 3). The idea to use sup-norms or localized sup-norms of the 
Rademacher process as bootstrap-type estimates of the size of correspond- 
ing suprema of the empirical process has originated in machine learning 
literature (see [4, 5, 14, 26, 27, 34]). The current paper continues this line 
of research. Very recently, Bartlett and Mendelson [7] developed an interest- 
ing new definition of localized Rademacher complexities and gave a curious 
example in which this complexity provides a sharper bound on the risk of 
empirical risk minimizers than the complexities studied so far. It is not clear 
yet whether the phenomenon they studied occurs in actual machine learning 
or statistical problems. Because of this, we do not pursue this approach in 
the current paper. 

The fourth problem is to develop rather general model selection techniques 
in risk minimization that utilize our data-dependent bounds on the excess 
risk (Sections 5, 6). More precisely, we study a version of structural risk 
minimization in which the class T is approximated by a family of classes 
3~k, k > 1 (they are often associated with certain models, e.g., in regression or 
classification) and the empirical risk minimization problem (1.2) is replaced 
by a family of problems 

(1.3) P n /^min, f€F k ,k>l. 

The goal now is, based on solutions f n ^ of problems (1.3) and on the data, 
to construct an estimate k of index k{P) of the "correct" model (i.e., a value 
of k such that the solution of risk minimization problem (1.1) belongs to J 7 /-, 
or at least is well approximated by this class) and an "adaptive" solution 
/ = ft whose excess risk is close to being "optimal." The optimality of the 
solution is typically expressed by so-called oracle inequalities which, very 
roughly, show that the excess risk of / is within a constant from the excess 
risk of the solution one would have obtained with the help of an "oracle" 
who knows precisely to which of the classes J~t the true risk minimizer 
belongs [knows k{P)}. This way of thinking has become rather common in 
nonparametric statistics literature where various types of oracle inequalities 
have been proved, most often, in specialized settings (see [23] for a discussion 
on the subject). 

The first general theory of empirical risk minimization was systematically 
developed by Vapnik and Chervonenkis [49] (see also [48] and references 
therein) in the late 1970s and early 1980s (although a number of more spe- 
cial results had been obtained much earlier, in particular, in connection with 
the development of the theory of maximum likelihood and M-estimation) . 
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They obtained a number of bounds on £p(f n ) based on the inequality 
£p{fn) < 2||P n — P\\p and on further bounding the sup-norm \\P n — P\\jr 
in terms of random entropies or, now famous, VC-dimensions of the class T 
[here and in what follows H^Hjf := supj g: p for Y:J-\— >R]. They also 

developed more subtle bounds that provide an improvement in the case of 
small (in particular, zero) risk. These results played a significant role in the 
development of the general theory of empirical processes (see [16, 47]). 

New developments in nonparametric statistics and, especially, in ma- 
chine learning have motivated a number of improvements in the Vapnik- 
Chervonenkis theory of empirical risk minimization. Our approach largely 
relies on well-known papers of Birge and Massart [8], Barron, Birge and Mas- 
sart [3] , and on the more recent paper of Massart [36] . These authors proved 
a number of oracle inequalities for regression, density estimation and other 
nonparametric problems. More importantly, they suggested a rather general 
methodology of dealing with model selection for minimum contrast estima- 
tors that is based on Talagrand's concentration and deviation inequalities 
for empirical processes [42, 43], a new probabilistic tool at the time when 
these papers were written. Despite the fact that in many special statistical 
problems the use of Talagrand's inequalities can be avoided and oracle in- 
equalities can be proved relying on more elementary probabilistic methods, 
one could hardly deny that concentration inequalities are the only univer- 
sal tool in probability that suits the needs of model selection and oracle 
inequalities problems extremely well and are, probably, unavoidable when 
these problems are being dealt with in their full generality (e.g., in a ma- 
chine learning setting). Talagrand's inequalities will be the main tool in this 
paper. Another important piece of work is the paper by Shen and Wong [39] 
where empirical processes methods were used to analyze empirical risk min- 
imization on sieves (and, in particular, a version of iterative localization of 
excess risk bounds close to the approach discussed above was developed in 
a more specialized framework). 

One of our main motivations was to understand better the results of Mam- 
men and Tsybakov [35] on fast convergence rates in classification as well as 
more recent results of Tsybakov [44] and Tsybakov and van de Geer [45] on 
adaptation strategies for which these rates are attained. Our goal is to in- 
clude these types of results in a more general framework of abstract empirical 
risk minimization (see Section 6). Another goal is to include into the same 
framework some other recent model selection results, especially in learning 
theory, where there is a definite need to develop general data-driven com- 
plexity penalization techniques suitable for neural networks, kernel machines 
and ensemble methods (see [28, 29, 30]). The analysis of convergence rates 
and the development of adaptive strategies for classification are currently at 
early stages (even consistency of boosting and kernel machines classification 
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algorithms was established only recently; see [33, 40, 50]). Very recently, 
Bartlett, Jordan and McAuliffe [6] and Blanchard, Lugosi and Vayatis [10] 
obtained convergence rates of boosting-type classification methods based on 
convex risk minimization. Blanchard, Bousquet and Massart [9] obtained 
interesting oracle inequalities for penalized empirical risk minimization in 
kernel machines. It is of importance to develop better general ingredients 
of the proofs of such results so that it would be possible to concentrate on 
more specific difficulties related to the nature of the classification problem. 
These types of problems as well as a somewhat more general framework of 
convex risk minimization, including regression problems, are also within the 
scope of the methods of this paper (Sections 7, 8). 

The proofs of all main results in the paper are given in Section 9. 

2. Preliminaries. 

2.1. Talagrand's concentration inequalities. Most of the results of the 
paper are based on famous concentration inequalities for empirical pro- 
cesses due to Talagrand [42, 43] (that provide uniform versions of classical 
Bernstein's- type inequalities for sums of i.i.d. random variables). We use the 
versions of these inequalities proved by Bousquet [13] and Klein [24] (see [11] 
for some other relevant inequalities). Namely, for a class T of measurable 
functions from S into [0, 1] (by a simple rescaling [0, 1] can be replaced by 
any bounded interval) the following bounds hold for all t > 0: 

• Bousquet 's bound: 



\Pn~Ph > nPn-P\\T+]/2^(a 2 P (T) + 2E||P n - PU^ + i- j < e 
Klein's bound: 



F^\\P n -Py<E\\P n -Py-^2^(aU^) + 2E\\P n -P\\ T )-^<e- t 

(we modified Klein's bound slightly). Here crp(jT) := supf e:F (Pf 2 - (Pf) 2 ). 

2.2. Empirical and Rademacher processes. The empirical process is com- 
monly defined as n l l 2 (P n — P) and it is most often viewed as a stochastic 
process indexed by a function class T :n 1 l 2 {P n — P)(f), f G T (see [16] 
or [47] ) . The Rademacher process indexed by a class T is defined as 

n 

R n {f):=n' 1 Y j e l f{X i ), fef, 
i=i 

{e^ being i.i.d. Rademacher random variables (i.e., £j takes the values +1 
and —1 with probability 1/2 each) independent of {Xj}. Roughly, R n (f) 
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is the value of empirical correlation coefficient between f(Xi), i = 1, . . . , n 
and Rademacher random noise. If ||-R n ||;r is large, it means that there exists 
/ G T for which f(Xi) fits the noise well. Using such a class T in empirical 
risk minimization is likely to result in overfitting, which provides an intuitive 
explanation of the role of ||i? n ,||jF as a complexity penalty in empirical risk 
minimization problems. 

Rademacher processes have been widely used in the theory of empirical 
processes because of the following important inequality: 

\n\Rn\W c < ^WPn - P\\F < 2E\\R n \\jr, 

where T c := {/ — Pf : f G J 7 }. The upper bound is often referred to as 
a symmetrization inequality and the lower bound as a desymmetrization 
inequality. We will use this terminology in the future. These inequalities 
were brought into the theory of empirical processes by Gine and Zinn [21]. 
It is often convenient to use the desymmetrization inequality in combination 
with the following elementary lower bound: 



E\\R n \\jr c > K\\R n \\ T - sup \Pf\ E\R n {l)\ 

n 



> E||i? n ||jr - sup |P/|E 1/2 



>E\\Pu\\t 



sup /eJF |P/| 



n 



Rademacher processes possess many remarkable properties. In particular, 
they satisfy the following beautiful contraction inequality: if T is a class 
of functions with values in [—1, 1], <p is a function on [—1, 1] with (p(0) = 
and of Lipschitz norm bounded by 1, and ip o T := {(p o / : / £ J 7 }, then 
Ell-Rnll^o^" < 2E||ii n ||jF (follows from [31], Theorem 4.12). This implies, for 
instance, that 



E sup 



n 



i=i 



< 4E sup 



n 



i=i 



Concentration inequalities also apply to the Rademacher process since it 
can be viewed as an empirical process based on the sample (Xi, ei), . . . , (X n , e n ). 

Often one needs to bound expected suprema of empirical and Rademacher 
processes. This can be done using various types of covering numbers (such as 
uniform covering numbers, random covering numbers, bracketing numbers, 
etc.) and the corresponding Dudley's entropy integrals. For instance, let 
N{J r ;Li2{Pn)]£) denote the minimal number of L2(-fn)-balls of radius e cov- 
ering T . Suppose that V/ G J 7 , Vx G S : \f(x)\ < F(x) < U, where U > and 
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F is a measurable function (called an envelope of J-). Let a 2 := sup^-p Pf 2 . 
If for some A > 0, V > 




then with some universal constant C > (for a 2 > const n 1 ) 




If for some A > 0, p G (0, 1) 




then 




The proofs of these types of bounds can be found in [17, 18, 20, 37, 41]; the 
current version of (2.4) is due to Gine and Koltchinskii [19]). 

In particular, if T is a VC-subgraph class, then the condition (2.1) holds 
(in fact, the condition holds even for the uniform covering numbers) and one 
can use the bound (2.2). We will call the function classes satisfying (2.1) VC- 
type classes. If Ti is VC-type, then its convex hull conv(W) satisfies (2.3) with 
P '■= \Tf2 ( see [47]), so one can use the bound (2.4) for T C conv(H) (note 
that one should use the envelope F of the class Ti itself for its convex hull 
as well). Many other useful bounds on expected suprema of empirical and 
Rademacher processes (in particular, in terms of bracketing numbers) can 
be found in [47] and [16]. 

2.3. The ^-transform and related questions. In this section, we introduce 
and discuss some useful transformations, involved in the definitions of vari- 
ous complexity measures of function classes in empirical risk minimization. 
As it has been already pointed out in the Introduction, the excess risk bounds 
are often based on solving the fixed point equation, or, more generally, equa- 
tions of the type tp{5) = e8, for = U n (-;t). This naturally leads to the 
following definitions. 

For a function tp : M + i— ► R + , define 



1/^(8) := sup 



and ip^(e) : 



mf{8>0:ip\8) <e}. 



a 



We will call these transformations, respectively, the b-transform and the 
^-transform of ip. We are mainly interested in the {(-transform. It has the 
following properties whose proofs are elementary and straightforward: 
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1. Suppose that ip(u) = o(u) as u — > oo. Then the function ijfi is defined 
on (0, +oo) and is a nonincreasing function on this interval. 

2. If <ip2, then < Moreover, it is enough to assume that ipi(5) < 
ip2(5) either for all 8 > V4( e )> or for all 5 > ij\ (e) — r with an arbitrary t > 0, 
to conclude that i\{e) < ^{e). 

3. For a > 0, (a^)»(e) = ^{e/a). 

4. If e = ei H h e m , then 

4(e) V • • • V < (Vi + • • • + V*™)^) < Vfei) V • • • V ^(e ro ). 

5. If ip(u) = c, then ^"(e) = c/e. 

6. If ^(u) := u Q with a < 1, then ^(e) := e -V(i-«) . 

7. For c> 0, let ^ C (<S) := V(c<5). Then ip\(e) = \ift(e/c). If V is nonde- 
creasing and c > 1, then this easily implies that cip^(u) < ^(u/c). 

8. For c> 0, let now ip c (S) := ip(5 + c). Then for all u > 0, e G (0, 1], 
< V tt (W 2 ) -cVce. 

Let us call t/> : M+ h- > R + a function of concave type if it is nondecreasing 
and u i— > is decreasing. If, in addition, for some 7 G (0, 1), u 1— > is 
decreasing, -0 will be called a function of strictly concave type (with ex- 
ponent 7). In particular, if ip(u) := ^(u 7 ), or ip(u) := </J 7 (ii), where 99 is a 
nondecreasing strictly concave function with <p(0) = 0, then ifj is of concave 
type for 7=1 and of strictly concave type for 7 < 1 . 

9. If ip is of concave type, then i/>" is the inverse of the function 5 1— > ^jS-. 
In this case, V"( c,u ) — ^{u)/c for c < 1 and ^"(cit) < ip^(u)/c for c > 1. 
10. If ip is of strictly concave type with exponent 7, then for c < 1, 

It will be convenient sometimes to discretize the supremum in the defini- 
tion of tp . Namely, let q > 1 and 5j := q~i , j € Z. Define 

^(5) := sup ^«'«(e) := inf{<5 > : ^'"(5) < e} 

«5 3 ><5 Oj 

and 

If-j&j (5) := i sup >5 -0^1] ( £ ) := inf ^ G (°» !] : ^1] (*) ^ e > 

(if in the last definition V , p' ? i](^) is larger than s for all <5 < 1, then we set 

Properties 1-4 and 7 hold for t/^' 9 with the following obvious changes. In 
property 2, it is enough to assume that tpi(5) < ip2{$) only for 5 = Sj and the 
second part of this property should be formulated as follows: if ip\(5) < ip2{&) 
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either for all 5 > ^|' 9 (e), or for all 5 > q^ 1 ip\ ,q (e), then ijj^ q (e) < ip2 9 { £ )- 
Property 7 holds with c = q 3 for any j. We will refer to these properties as 
1'— 4' and 7' in what follows. 

Also, the following simple fact is true: 

11. If i\) is nondecreasing, then ^' 9 (e) < -0^ 9 (e) < i/^(e) < tp^' q (e / q) . In 
addition, if ip(8) = const for S > 1 (which will be the case in many situations), 

then Vfc](£) = ^ tt,<z (e)- 

We conclude this section with a simple proposition, describing useful prop- 
erties of functions of strictly concave type. 

Proposition 1. (i) If ip is a function of strictly concave type with some 
exponent 7 S (0, 1), then 



j:5j>5 J 

where c 7](? is a constant depending only on q,j. 

(ii) Under the same assumptions, the equation tp(5) = 5 has unique solu- 
tion 5. Suppose 6 < 1 and define 5q := I, 5k+i '■= V'(^fc) A 1. Then {5^} is a 
nonincreasing sequence converging to 5 and, for all k, 5k~S< <r 7 (1 — <5) 7 . 

2.4. Empirical and Rademacher complexities. The most natural com- 
plexity penalties in risk minimization problems are based on expected sup- 
norms of the empirical process over the whole class T or its subsets. How- 
ever, such complexities are distribution dependent, so it is hard to use them 
in model selection. The idea to use Rademacher processes to construct 
data- dependent complexity penalties in model selection problems of learn- 
ing theory was suggested independently by Koltchinskii [26] and Bartlett, 
Boucheron and Lugosi [4]. It is based on the following simple observa- 
tion: if one combines the symmetrization inequality with concentration in- 
equalities for empirical and Rademacher processes (in fact, with simpler 
Hoeffding-type concentration inequalities based on the martingale difference 
approach), one can get the following bound: 

3t } ( 2t 2 1 

\P n ~ P\\jr > 2\\R n \\f + — \ < expl -— \, t > 0. 



n I 13 



Quite similarly, using instead the desymmetrization inequality one can get 
a simple lower confidence bound on \\P n — P\\^r in terms of ||i? n ||^-. Since 
the Rademacher process does not involve the unknown distribution directly 
and can be computed based only on the data, one can use H-Rnll^" as a data- 
dependent measure of the accuracy of approximation of the true distribution 
P by the empirical distribution P n uniformly over the class. Essentially, this 
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justifies using ||-R n ||.f as a bootstrap-type complexity penalty associated with 
the class T (although Rademacher bootstrap is not asymptotically correct). 
The main problem, however, is that such global complexities as ||i2 n ||jr do not 
allow one to recover the convergence rates in risk minimization problems. 
Typically, ||i?n||^ would be of the order 0(n~ l l 2 ) (this is the case, e.g., 
for VC-classes and, more generally, for Donsker classes of functions). The 
convergence rates in many risk minimization problems are often faster than 
this and they are related to the behavior of the continuity modulus of the 
empirical process n 1 / 2 (P n — P) rather than to the behavior of its sup-norm 
(see [36]). Thus, relevant data-dependent complexities could be based on the 
continuity modulus of the Rademacher process that mimics the properties 
of the empirical process. As we will see later, the complexities of this type 
are defined as the (J-transform of the corresponding (expected) continuity 
modulus. 

Let pp : L2(P) X L2{P) *— > [0, +oo) be a function such that 

P p(f,g) > P(f - g? - (P(f - g))\ f,ge L 2 (P). 

Typically pp will be also a (pseudo)metric, for instance, pp{f,g) = P(f — g) 2 
or f?Mg) = P(f-g) 2 -(P(f-g)) 2 . 

Given a function Y : T i— ► R, define its continuity moduli (local and global) 
as follows: 

u pp (Y;f;8):= sup \Y(g)-Y(f)\ and 

g^,p P {gJ)<8 

u pp (Y;6):= sup \Y(f)-Y(g)\. 

f,geF,pp(f,g)<5 

Assume, for simplicity, that the infimum of Pf over T is attained at a 
function / Ej- (we are assuming this in what follows whenever it is needed; 
otherwise, the definitions can be easily modified). Let 

9 n {5) := 9 n (F; /"; 5) := Eio pp (P n - P; /; V6). 

The empirical complexity, such as the ones previously used in [5, 14, 27, 36], 
can be now defined as #fj(e) where e is a numerical constant (often, e = 1, 
which corresponds to the fixed point equation, but sometimes the depen- 
dence on e is of importance). The function n {5) in this definition can be 
replaced by supj g: pEa;p p (P n — P; /; V6), or even by Eu> Pp (P n -P;V6), with- 
out increasing the complexity significantly (at least, in most of the relevant 
examples) . 

It will be shown in the next sections how to use these types of quantities 
to provide upper bounds on the excess risk. Now, we utilize the Rademacher 
process to construct data-dependent bounds on ^|(e). Suppose that p 2 P (f,g) ■= 
P(f-g) 2 . Define 

Q n (5) :=Ecd pp (R n ;V6), Lu n (5) := uj pPn (R n ;V5), 
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u> n ,r(5) := E £ w PPn (R n ; vS), 

where E e denotes the expectation only with respect to the Rademacher 
sequence {e.;}. 

The next lemma is pretty much akin to some statements in [5]. Koltchin- 
skii and Panchenko [27] proved some results in this direction in a more 
specialized setting of function learning (in zero error case). We give its proof 
in Section 9 for completeness and also because a similar approach is used in 
the proofs of several other results given below. 

Lemma 1 . For q > 1, there exist constants C, c > ( depending only on q) 
such that 

Ve>0 9i(e)<ot(e/2) 

and for all e G (0, 1] 

P|4(e) > C(ul{ce) + ^)}< 21og 9 f e"*, 

Pjute > Colics) + ^)} < 21og 9 ^e-'. 

The same is true with o4 replaced by oj1 it . 

2.5. Examples. We give below several simple bounds on local Rademacher 
complexities #|(e), £ G (0)1] that are of interest in applications and have 
been discussed, for example, in [5, 6, 10, 36]. 

Example 1 (Finite-dimensional classes). Suppose that T is a subset of 
a finite- dimensional subspace L of ^(-P) with dim(L) = d. Then 9 n (5) < 
(Sd/n) 1 / 2 and 6>|(e) <d/(ne 2 ). Indeed, if ei,...,ea is an orthonormal ba- 
sis of L, and g,g G L, 5 = EiLi«iej, 5 = EiLi^^, then \\d ~ 5ll| 2 (n) = 
Ei=i( a i ~~ ctj) 2 - Therefore, using the Cauchy-Schwarz inequality, 
9 n (6)=E sup |(p n _p)( 5 _g)| 

9ej c ',|i9-g||_ L2{ p)<v / 5 
d 

< E sup 



i=i 

_/ d \ 1/2 

< v / 5^E(P n -P) 2 (e i )J < 



and the second bound on is now immediate due to the properties of 

ti-transform. 
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Example 2 (Ellipsoids in L2). This is a simple generalization of the 
previous example. Suppose that T := {Tg : ||<?||l 2 (p) < 1}, where T : ^(-P) ► 
L/2(P) is a Hilbert-Schmidt operator with Hilbert-Schmidt norm ||T||p5 
and such that its operator norm ||T|| < 1. Thus, T is an ellipsoid in Hilbert 
space LiiP). Suppose also that Ker(T) = {0}, and, for f\ = Tg\, fa = Tg2, 
we define pp(/i,/2) = \\gi — 92\\l 2 {p)- Then, the same argument as in the 
previous example yields 9 n (5) < {5\\T\\ 2 HS /n) 1 / 2 and 0jl (e) < \\T\\ 2 HS /(ne 2 ). 

Often, it is natural to use Dudley's entropy integral to bound the function 
9 n (S) and then to derive a bound on O^e). Various notions of the entropy 
of function class J- can be used for this purpose (entropy with bracketing, 
random entropy, uniform entropy, etc.). This technique is standard in the 
theory of empirical processes and can be found, for example, in the book 
of Van der Vaart and Wellner [47]. Here are some examples of the bounds 
based on this approach. 

Example 3 (VC-type classes). Suppose that T is a VC-type class, that 
is, the condition (2.1) is satisfied (in particular, T might be a VC-subgraph 
class). Assume for simplicity that F = U = 1. Then it follows from (2.2) that 



which leads to the following bound: 0&(e) < CV/{ne 2 ) \og{ne 2 /V). 

Example 4 (Entropy conditions). In the case when the entropy of the 
class (uniform, bracketing, etc.) is bounded by 0{e~ 2p ) for some p G (0,1), 
we typically have ^|(e) = 0{n~ 1 ^ l+p ^). For instance, if (2.3) holds, then it 
follows from (2.4) (with F = U = 1 for simplicity) that 



Therefore, 0|(e) < CA 2 p^ 1+ p> /(ne 1 ) 1 ^ 1 ^. 

Example 5 (Convex hulls). If T := conv(H) := {J2j Xjhj : J2j \^j\< hhj G 
TC} is the symmetric convex hull of a given VC-type class Ti of measurable 
functions from S into [0, 1] , then the condition of the previous example is 

satisfied with p := This yields 61* (e) < (K(V)/(ne 2 ))^^ . 

Example 6 (Shattering numbers for classes of binary functions). Let J- 
be a class of binary functions, that is, functions / : S 1— > {0, 1}. Let 





(Xi, . . . , X n ) := card({(/(X 1 ), . . . , f(X n )) : f G J 7 }) 



14 



V. KOLTCHINSKII 



be the shattering number of the class T on the sample (X%, . . . ,X n ). Using 
a bound that can be found in [36], we get 



e n {5) < k 

which easily yields 



. Elog A JF (Xi, . . .,X n ) | Elog A jr (Xi, . . . ,X n ) 



n 



n 



ElogA jp (Xi, ...,X n 



Example 7 (Mendelson's complexities for kernel machines). Let K be 
a symmetric nonnegatively definite kernel on S x S and let Hk be the 
corresponding reproducing kernel Hilbert space, that is, Hk is the closure 
of the set of linear combinations J2i a iK( x i> ")' Xi £ S, cti £M. with respect 
to the norm || • \\k defined as 

2 



^2aiK(xi,-) =^2a i a j K(x i ,x j ). 



K 



i.j 



Suppose that T ■= Bk is the unit ball in Hk- Such classes are frequently 
used in learning theory for kernel machines. Let Aj be the eigenvalues of the 
integral operator generated by K in space Li2{P)- The following is a version 
of bounds of Mendelson [37]: 

/ 00 \ 1/2 

Cif" E A i A<5 <O n (S)=E sup \R n {f-g)\ 
J P(f-g) 2 <s,f,ge^ 



i=i 



1/2 



(n) 

with some numerical constants C\, C2 > 0. Similarly, if A^ , i = 1, . . . , n are 
the eigenvalues of the matrix (n~ l K{Xi, Xj) : 1 < i,j < n), then Mendelson's 
argument also gives 

v 1/2 



i=i 



sup 

Pn(f~g) 2 <5J,gef 



\Rn(f-g)\ 



1/2 



Denote the true and empirical Mendelson's complexities by 

/ 00 \ 1/2 

7n(«y)=7n(^;<5) = (n- 1 X;A i A<5) and 
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/ n \ V 2 

7n(5)=7n(^5)=^- 1 ^Aj n) A5j . 

Note that these functions are strictly concave, nondecreasing and are equal 
to for 5 = 0. Moreover, they are both square roots of concave functions 
and, hence, they are of strictly concave type. The properties of {(-transform 
imply that with some constants c\ , c 2 

tHOiO < u>l(e) < jl(c 2 e) and ^(cie) < &| )t .(e) < jl(c 2 £)- 

Together with Lemma 1, this allows one to use empirical Mendelson's com- 
plexity as an estimate of true Mendelson's complexity. 

3. First excess risk bounds. The idea to express excess risk bounds in 
terms of solutions of fixed point equations for continuity modulus of em- 
pirical or Rademacher processes and also to relate them to ratio-type in- 
equalities has been around for a while (see [5, 27, 36]). Comparing with the 
recent work of Bartlett, Bousquet and Mendelson [5], our approach in this 
section relates the excess risk bounds more directly to the diameter of the 
(5-minimal set of P (recall the definitions in Section 1) and also provides 
ratio-type inequalities for the empirical excess risk expressed in terms of 
ft-transform of the function U n {5\t) involved in Talagrand's inequality. The 
excess empirical risk is defined as £ n (f) '■= £p n {f) an d the <5-minimal set 
of P n as JF n {5) :=J r p n (5). Also, denote F(s,r] :=Fp{s,r] :=JF(r) \J r (s). 

Let f n := argmin^gjpi-^/ be an empirical risk minimizer [i.e., a solution 
of (1.2)]. For simplicity, we assume that it exists, although the results can 
be easily modified for approximate solutions of (1.2). Recall that D(5) := 
Dp(F;5) := sup f ge:F ,g\ pp(f,g) denotes the pp-diameter of the 5-minimal 
set and also that 

n (£):=<£ n (^;P;5):=E sup \{P n - P){f - g)\. 

Let 

U n (6;t) := U n>t {8) := <j> n {5) + J 2—(D 2 (5) + 2<p n (6)) + 

V n In 

Finally, let us fix q > 1 and define V n and 5 n (t) as follows: 

V n (5; t) := V n , t (5) := U%(6) and S n (t) := V** Q-^j . 

Whenever it is needed, we will write 5 n {!F;t) or 8 n (F;P\t) to emphasize 
the dependence of these types of quantities on function class and on distri- 
bution. The following result gives an upper bound on the excess risk of f n 
and also provides uniform bounds on the ratios of the empirical excess risk 
of a function / G T to its true excess risk. 
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Theorem 1. For all t> and all 5 > S n (t) 

n£(fn) >$}< log, f e -t and 

> qV n (5;t)}< log^e^. 

Almost as in Section 2, define the expected continuity modulus 

u n (f;S):=E sup \(P n - P)(f - g)\. 
pp(f,g)<8,f,ger 

Since 4> n ($) < uj n {!F;D{5))^ the behavior of 4> n can be determined by uj n 
and D. If J 7 is a P-Donsker class, then, by asymptotic equicontinuity of 
empirical processes, 

lim limsupn 1 / 2 Lo' n (^ r ; 5) = 0. 

This fact and the definition of 5 n (t) immediately imply that 5 n (t) = o(ra -1 / 2 ) 
as soon as T is P-Donsker and D{$) — > 0. The last condition is natural if the 
risk minimization problem (1.1) has unique solution. Moreover, there exists 
a sequence t n — > oo such that 5 n (t n ) = o(ra -1 / 2 ). Thus, by Theorem 1, we can 
conclude that £p(f n ) = op(n -1 / 2 ) whenever the empirical risk minimization 
occurs over a P-Donsker class and D(5) — > 0. This observation shows that 
convergence rates of the excess risk faster than n -1 / 2 (that came as a surprise 
in classification problems in nonzero error case several years ago) are, in fact, 
typical in general empirical risk minimization over Donsker classes. 

In the case when the function / i— ► Pf has the unique minimum in T (i.e., 
the minimal set TifS) consists precisely of one element), the quantity 5 n (t) of- 
ten gives correct (in a minimax sense) convergence rate in risk minimization 
problems (see Section 6.1). However, if .F(O) consists of more than one func- 
tion, then the diameter D(5) of the <5-minimal set becomes bounded away 

from and as a result S n (t) cannot be smaller than (and the optimal 

convergence rate is often better than this, e.g., in classification problems). 
In the next section, we study more subtle geometric characteristics of the 
class T that might be used in such cases to recover the correct convergence 
rates. 

An important consequence of Theorem 1 is the following lemma that 
shows that <5-minimal sets can be estimated by empirical <5-minimal sets 
provided that 5 is not too small. 

Lemma 2. For all t > 0, there exists an event of probability at least 
1 — log 9 j2_e _i such that on this event V<5 > 5 n (t):J r (5) C F n (35/2) and 



sup 



£(.f) 
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Note that, as follows from the definition, 5 n (t) > ^, so the probabilities 
in Theorem 1 are, in fact, upper bounded by log g j exp{— i) (which depends 
neither on the class nor on P). The logarithmic factor in front of the 
exponent, most often, does not spoil the bound since in typical applications 
S n (t) is upper bounded by S n + ^, where S n is larger than los ' os " , Adding 
log log n to t is enough to eliminate the influence of the logarithm. However, 
if 5 n = 0(n _1 ), the logarithmic factor would create a problem. It is good to 
know that it can be eliminated under extra conditions on (f> n (S) and D(5). 
More precisely, assume that (f> n (S) < 4> n (5) and D(5) < D(S), 5 > 0, where 
<j> n is a function of strictly concave type with some exponent 7 £ (0, 1) and 
D is a concave-type function (see the definitions in Section 2.3). Define 

U n (5; t) := U n , t (S) := K (^ n (S) + D(6)yJ ^ + 

with some numerical constant K. Then U n {-\t) is a concave-type function. 
In this case, it is natural to define 

K(5;t):=^) = ^M and S n (t):=Ul t (^. 



Theorem 2. There exists a constant K such that for all t> and for 
all 5 > 5 n (t), 



H£(fn) >5}< e~ l and P\ sup 

f&F,e[f)>8 



SnU) 



1 



>qV n (6;t)\<e 



In what follows we do not use this refinement except in several cases when 
it is really needed. 

Now we outline a way to define the empirical version of 5 n (t). To this end, 
it will be convenient to choose Pp(f,g) '■= P{f — g) 2 ■ Note that 



U n (S;t) < U n (S;t) := U n>t (S) :=K\ <j> n {5) + D{5)\ - + 

n n 



t t 



where K = 2. Hence, if we define S n (t) := U^ q t (l/2q 3 ), then it follows from 
the definitions that S n (t) < 5 n (t). 

Define the empirical versions of the functions D and 4> n as follows: 

D n (S):= sup pp n {fi9) and 4>n(S):= sup \R n {f - g)\. 
f,ge? n (S) f,ge? n (6) 
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Let 



U n (5; t) := U nyt (5) := K \ cf> n (c6) + D n {c5)\l - + * 



n n 



U n (5;t) := U n;t (5) := K[ cj> n (cS) + D(c5)J^ + -J , 

V V n n I 

where 2 < K < K , c, c > 1 are numerical constants. It happens that U n 
is a data-dependent function that upper bounds U n with a high probabil- 
ity. U n is a distribution-dependent function that provides an upper bound 
on U n (again, with a high probability). We now construct V n ,V n ,V n from 
U n ,U n , U n the same way as we have constructed V n from U n and set 6 n (t) := 

We will prove the following theorem. 
Theorem 3. For allt>0 

nUt) < Lit) < ~6 n (t)} > 1 - (log, + 41og g exp{-t}. 

In many situations, 5 n (t) and 5 n (t) are asymptotically within a constant 
one from another as n — > oo. The above theorem suggests that S n (t) can be 
used as an estimate (up to a constant) of 5 n (t) and this allows one to use 
this quantity as a data-dependent penalty in a model selection setting. 

4. Toward sharper inequalities for excess risk. Suppose that risk mini- 
mization problem (1.1) has multiple solutions. This is a possibility, for in- 
stance, in risk minimization with nonconvex loss functions. Also, in a model 
selection framework (see Section 5) one deals with a family of risk mini- 
mization problems over classes TkdJ- that approximate problem (1.1). It 
is possible then that the global minimum of risk over the class T is attained 
at a number of different competing classes (models) Th- Anyway, the multi- 
ple minima case has to be understood as a part of comprehensive theory of 
empirical risk minimization. In such cases, the diameter D(5) = Dp^; 5) of 
the (^-minimal set does not tend to as 5 — > 0, and it is easy to see that the 
quantity S n (t) defined in the previous section is going to be at least as large 
as 0(n -1 / 2 ). As a result, the bounds we have proved so far are not neces- 
sarily optimal. The question is whether it is possible to replace the diameter 
D(5) by a more sophisticated geometric characteristic that would allow us 
to construct tighter bounds on the excess risk. We explore in this section one 
possible approach to this problem. Namely, we define the following quantity: 

r(a;5):= sup inf pp(f,g), < a < 5, 
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that characterizes the accuracy of approximation of the functions from the 
(5-minimal sets by the functions from the cr-minimal set for two different 
levels 5 and a. If J-(0) 7^ (i.e., the minimum of Pf is attained on J-), r is 
also well defined for a = 0, 5 > a. 

The function r{a, 5) is nondecreasing in 5, nonincreasing in a and r{5, 5) = 
0. If we extend r to a > 5 by setting r(a; 5) := r(S; a), then, using the trian- 
gle inequality for pp, it is easy to check that r is a pseudometric. Clearly, 
^((7,3) < D(5). Moreover, it is not hard to imagine the situations when 
r(0;<5) is significantly smaller than D(5) [say, r(0;<5) — > as 5 — > whereas 
D(5) is bounded away from 0]. Suppose, for instance, that T := [jj Fj, 
where Tj are classes of functions such that V/c, j :mmjr. Pf = min p k Pf 
(we assume that the minima are attained). Then it is easy to check that 
r(0;5) < supj Dp{JFj\ 5). Of course, one can come up with examples of this 
sort in which r(0,5) — > as 5 — > 0, but D{5) is bounded away from 0. 

It is not completely unnatural to expect that the function r satisfies the 
condition of the following type: 

(4.1) r(0;ci<5)<c 2 r(0;,5), <5 G (0, 1] 

for some constants c±,c<2 < 1. Since r(0;<5) < r(0;ci<5) + r(ci5,5), we get for 
all a < c\5 

which means that the values of r(<r; 5) are within a constant one from another 
for all a that are not too close to S (cr < ci<5). 
Let 

Ma, 5) :=limE sup sup |(P n - P)(J - g)\ 

gdT(a) feF(5),p P (f,g)<r{a,5)+e 

and 

U n (a; 5; t) := Ma, 5) + J 2-(r 2 (a, S) + 2^ n (cr, 5)) + ^. 

V n 2n 

Almost as before, we will need 

V n (a;5;t) := sup f . 

Finally, we define 5 n {a; t) := inf{<5 : V n {a; 5; t) < l/2q}. Clearly, 5 n {a; t) is the 
(j, g-transform of the function 6 1— ► U n (a; 6;t) + a computed at the point l/2q. 
We obtain the following version of Theorem 1. 

Theorem 4. For all a G (0, 1], all t> and all 5 > 5 n (a; t), 
P{«f(/n)>5}<log,|exp{-t} 
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and 

p|3/ G J- -£(f) > 5 and m± < 1 - qV n {a; 5; t)| < log g | exp{-i}. 

Note that, unlike the inequalities of Theorem 1, we have here only a 

£ ( f ) 

one-sided bound for the ratio ~§Hf- As a result, it is easy to show that, 
for all a G (0, 1] and all t > 0, there exists an event of probability at least 
1 — log„ j-j — r e_ * such that on this event V<5 > S n (a, t) the inclusion .F n (5) C 
!F(28) holds, but not the other inclusion of Lemma 2. The following propo- 
sition shows that this difficulty is unavoidable and the set T n {5) does not 
include even .F(O) for the values of 5 of the order 5 n (a; t), or even larger. Be- 
cause of this reason, the estimation of the quantity r(<r; 5) based on the data 
is a much harder problem than the estimation of the diameter Dp 
The discussion of this problem goes beyond the scope of the paper. 

Proposition 2. Let S := {0, l}^ 1 and P be the uniform distribu- 
tion on {0, l}^ 1 . Let T :={fj: 1 < j < N + 1}, where fj(x) =x j; x = 
(x\, . . . , xjv+i) G {0, l} Ar+1 . Then the following statements hold: 

(i) £ P (/) = 0; 

(ii) with some C > 0, 5 n (cr;t) < Ct/n; 

(iii) with some c> 0, Sjt) > c((log N/n) 1 / 2 + (t/n) 1 / 2 ); 

(iv) for any e > i/iere exists Nq such that, for Nq < N < y/n and for 8 = 
0.25(log N/n) 1 / 2 , the inclusion J~(0) C F n (5) does not hold with probability 
at least 1 — e. 



5. Model selection. Consider a family of function classes {Tk} such that 
V/c, T\~ C T . In applications, the classes \T^\ are used to find an approxi- 
mate solution of risk minimization problem on the bigger class T of functions 
of interest. Let := f n ^ := argminj g jp fe P n f be the corresponding empirical 
risk minimizers (we assume for simplicity that they exist). The goal is to con- 
struct, based on {f Hi k}, a function / G T for which the excess risk £p{F; f) 
is small. To formulate the problem more precisely, suppose that there exists 
an index k(P) such that inLp fe(p) Pf = inLpP/, that is, a risk minimizer 

over the large class J- can be found in a smaller class J~k(p)- Let 5 n (k) be 
an upper bound on the excess risk (with respect to the class J~k) of f n ^ 
that provides the optimal (in a minimax sense), or just a desirable accuracy 
of the solution of empirical risk minimization problem on the class T^. If 
there were an oracle who could tell a statistician that k(P) = k is the right 
index of the class to be used, then the risk minimization problem could be 
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solved with the accuracy S n (k). The model selection problem deals with con- 
structing a data-dependent index k = k(X\, . . . ,X n ) of the model such that 
the excess risk of / := ft is within a constant from 5 n {k{P)) with a high 
probability. More generally, in the case when the global minimum over T is 
not attained precisely in any of the classes Tk, one can still hope to show 
that with a high probability 

£p{T\ f) < Cinf [inf Pf - Pf* + n n (k) 



k 



where /* := argminj g: pP/ (its existence will be assumed in what follows), 
Tr n (k) are "ideal" distribution-dependent complexity penalties associated 
with risk minimization over Tk and C is a constant (preferably, C = 1 or 
at least close to 1). The inequalities that express such a property are often 
referred to as oracle inequalities. 

Among the most popular approaches to model selection are penalization 
methods, in which k is defined as a solution of the following minimization 
problem: 

(5.1) k:=avgmm{P n f k + Tr(k)}, 

k>l 

where n(k) is a complexity penalty (generally, data dependent) associated 
with the class (the model) In other words, instead of minimizing the em- 
pirical risk on the whole class T we now minimize a penalized empirical risk. 
We discuss below two penalization methods (one in spirit of [34], another one 
more in spirit of [36]) with the penalties based on data-dependent bounds 
on excess risk developed in previous sections. Penalization methods proved 
to be very useful in a variety of statistical problems, including nonparamet- 
ric regression. However, there are substantial difficulties in implementing 
model selection techniques based on penalization in nonparametric classifi- 
cation problems. Up to our best knowledge, this approach has failed so far to 
produce adaptive classification rules with fast Tsybakov's-type convergence 
rates (an exception is the recent result by [45] that achieves this goal, but 
only in a very special and somewhat artificial framework). As an alterna- 
tive, we discuss a general model selection technique based on comparing the 
minima of empirical risk for different models with certain data-dependent 
thresholds (defined in terms of excess risk confidence bounds of the pre- 
vious sections) that allows one to recover Tsybakov's convergence rates in 
very general risk minimization problems, including classification (note that 
Tsybakov [44] also used a version of comparison method in a specialized 
classification framework) . 

To provide some motivation for the approaches discussed below, note 
that ideally one would want to find k by minimizing over k the global excess 
risk £p(jF;fnk) °f the solutions. This is impossible without oracle's help, 
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so one has to develop some data-dependent upper confidence bounds on 
the excess risk. The following trivial representation (that plays the role of 
"bias- variance decomposition" ) 

£p(F; /„,*) = mf Pf - Ph + £p(F k ;fn,k) 

Fk 

shows that part of the problem is to come up with data-dependent upper 
bounds on the local excess risk £p(J~k] fn,k), which is precisely what we 
considered in the previous sections. Another part is to bound infjp fe Pf — 
Pf* in terms of inf^- fc P n f — Pnf*, which is what we do in Lemma 4 below. 
Combining these two bounds provides an upper bound on the global excess 
risk that can be now minimized with respect to k (P n f* can be dropped since 
it does not depend on k). Another approach is to use the representation 

£p{F\ fn,k) - £p{F; f n ,i) = inf Pf - MPf + £p{J r k'Jn,k) ~ £p{Fi\ fn,i) 

■Fk -Fi 

and data-dependent bounds on local excess risk to develop a model selec- 
tion technique based on comparison of the difference between infjr fe P n f and 
inf^ P n / with certain data-dependent thresholds (which is done in Sec- 
tion 5.3 below). 

For Q C J-, the distribution-dependent complexity S n (Q; t) is defined as in 
Section 3 [5 n (t) = U^ t (l/2q 3 )}. Let t k > and let (5 n (^" fc ;t fc ) and 8 n (F k \t h ) 
be, respectively, data-dependent and distribution-dependent complexities 
such that 

(5.2) VA: F{5 n (F k ;t k ) < 5 n (f k ;t k ) < 8 n (F k ;t k )} >1- Pk . 

In particular, one can use the version of these complexities constructed in 
Section 3, in which case p k := log. ^e~' fc +41og„ ^e~ tk , by Theorem 3. We 
use these notations throughout the section. 



5.1. Penalization method: version 1. Define the following penalties: 

tk 



7t(k) :=K 



7f(fc) :=K 



V n F k n 



and 



n F k 



n 



where K,K are sufficiently large numerical constants. Here n(k) represents 
a "desirable accuracy" of risk minimization on the class T k . The index esti- 
mate k is defined according to standard penalization method (5.1) and we 
set f := f t. 
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Theorem 5. There exists a choice of K,K such that for any sequence 
{tk} of positive numbers, 

OG / 3 
F{Pf > mf{P n fn,k + <Ek + ^ e ~ h 

1 fc>i J fc=1 V t k 

and 

f{£ p (F; /) > inf { inf Pf - inf P/ + tt(A;)}} < £ ( Pk + log, ^V** . 

The first bound of the theorem is an upper confidence bound on the risk 
of / in terms of minimal penalized empirical risk. The second bound is an 
oracle inequality showing that the excess risk of the function / is nearly 
optimal (up to complexity penalty terms). 

The proof relies on the following lemma, which might be of independent 
interest. 

Lemma 3. Given a class J- of measurable functions from S into [0,1], 
suppose that, for some t>0 and p G (0,1), P{S n (^F;t) <5n(-^";i)} > 1 — P- 
Then the following inequalities hold: 



MP n f - inf Pf > 25 n (T; t) + J-MPf + -\< log, Jj-e' 1 



and 



It Ht\ rfi 

MP n f - inf Pf > 45 n (f; t) + 2 J -MP n f + -\<p + log. 



T r y n r n J q 5 n (t) 

5.2. Penalization method: version 2. For this version of penalization 
method, the following assumption is crucial: 



(5.3) V/ef Pf-Pf*><ptyV*xp(f-f*)), 

where tp is a convex nondecreasing function on [0, +oo) with <p(0) = 0. We 
also assume that ip(uv) < (p(u)<p(v), u,v>Q. The function ip is supposed to 
be known and is involved in the definition of the penalties. This is the case, 
for instance, in least squares regression where one can use <p{u) = u 2 /2 (see 
Section 8). However, in classification problems <p is typically unknown, but 
it has a significant impact on the convergence rates. Adapting to unknown 
function (p is a challenge for model selection in classification setting. 

Denote ip*(v) := sup u>0 [uv — p(u)] the conjugate of p>. We have uv < 
<p(u) + ip*{v), u, v > 0. For a fixed e > 0, define the penalties as follows: 

Tr{k) := A(e)5 n (T k ;t k ) + p* 
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and 



7T(fc) := - A{£ ) ~5 n {Fk;t k ) + 



1 + <p(V£) 



! + ¥>(>/£) 



+ 



(l + ¥>(>/i)) n 



where A(e) :- 



ip(i/e). As before, k is denned by (5.1) and / := /, 



n,k' 



Theorem 6. For any sequence {t k } of positive numbers, 

W{£ P (F; f) > C(e) mffm^ Pf - MPf + 7r(fc)}} 

9 2 ™ -*« 



<J2( Pk + 2lo gq ^e 
k=i v lk 

where C(e):=^$ y 

The following lemma is used in the proof. 

Lemma 4. Let Q C T. For all t > 0, there exists an event E with prob- 
ability at least 1 — log g ^-p L e~ t such that on this event 



(5.4) inf P n f - P n f, < (1 + <p(y/I))[MPf - Pf.) + <p 



and 



(5.5) 



mfPf-Pf*<{l-y{^))- 1 
y 



miP n f-P n f, + -S n (g;t) + tp* 
G Z 



In addition, if there exists 5 n (Q;e;t) such that 



then 



(5.6) 



— I + - 

n 



S n (G; t) < e(miPf - Pf*) +5 n (G;e;t) 



1 - <p(y/e) - 



mtP n f-P n U + -8 n {g-e-t)+tp* 
Q I 



— I + - 

n 



+ 



n 



Remarks. 1. It is easily seen from the proofs that the same inequality 
holds for arbitrary penalties Tt(k) and jr(k) such that with probability at 
least 1 — p k 



7r(k)>A(e)S n (T k ;t k ) + ip*\ 



+ 



n 
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and 



m>_ *(*> + ^ ( v%> + 



l + <p(VZ) l + <p(y/e) (l + ^(Ve))n' 

2. Suppose that the following condition holds: 

M^;t)<e(mfP/-P/*) + £„(.F fc ;e;t), 

as is the case in Lemma 5 below. Suppose also that there exist 5 n {J-k\ s; £&), 
^(jFfejejtfc) such that 

VA; P{5„(Pfc; e; t fc ) < e; tfc) < 4(^5 e; **)} > 1 - Pfe- 

Then, using the bound (5.6) of Lemma 4, one can easily modify Theo- 
rem 6 replacing in the definition of the penalties the quantities 5 n {J~k\ tk), 
8n{Fk',tk)> 8n(Fk',tk), W 5 n (J : k;e;tk), 5 n {Fk\ e; t k ), <5 n ("F fe ;e;ife) and also defin- 
ing 

A(e) := | + (1 - ip(y/e) - |e)/(l + e) and 
(7(e) := (1 + <p(y/e))(l + e)/U " v(v'i) - 

3. Note also that if <5 n (Pfc;£fc) is replaced by <5 n (•?"*; j*fc)> defined as in The- 
orem 2, the result of Theorem 6 is also true, and, moreover, the logarithmic 
factor in the oracle inequality can be dropped: the expression in the right- 
hand side of the bound of Theorem 6 becomes J2T=i(Pk + 4e~ <fc ). 

4. The result also holds if condition (5.3) holds for each k and for all 
/ G Tk with its own function (but with the same function /*) and the 
sequence of functions {ifk} is nonincreasing: VA; cp^ > (pk+i- In this case, one 
should use the function tp^ in the definitions of fr(k),n(k). C(e) is defined 
as before with ip = ip\. 

5.3. Comparison method. The version of comparison method presented 
here relies on the following assumption: Pi C P2 C • • •. Denote 

S n (k) := max 5 n (jFj',tj), 5 n (k) := max 8 n {Tf,U), 
5 n (k) := max <5 n (P,-;ij) 

l<j<k 

and define with some numerical constants c, c, c and with inf being oo if the 
set of fe's is empty: 

k* :=k*{P) :=inf{fc:VZ > fcinf P/ = inf P/}, 



k 



:= k{P) := infj/c: V/ > fcinf Pf - inf Pf < cS n Q)}, 
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k := inf{fc:VZ > fcinf P n f - m£P n f < 66 n (l)} 



jb := jfe(P) := infjfc : V/ > fcinf Pf - inf Pf < c<5 n (Z)}. 

Finally, let / := / ? (if k = oo, / can be defined in an arbitrary way, say, 
/ = fn,l)- 

Theorem 7. There exists a choice of constants c,c,c such that with 
some constant C > for any sequence {tk}, > 



(pf-infinfPf > inf \in£Pf-mfMPf + CS n (k)\\ 

co , 2 \ 



fc=l 

In particular, if k*(P) < oo , t/ien 



P{P/ - inf inf P/ > C6 n (k*(P))\ < \Pk + log, V e " 

Remarks. 1. If k(P) = oo, assume that the infimum over k > k(P) is 
equal to 1, which makes the first bound trivial. If k{P) < oo, it follows from 
the proof that so is k (with an exception of the event whose probability is 
controlled in the theorem). 

2. If Sni^Pk'itk) is replaced by Sni^k'-, tk) (as defined in Theorem 2), then the 
logarithmic factor in the oracle inequality can be dropped and the expression 
in the right-hand side of the bounds becomes YlT=iiPk + 2e _tfe ). 

6. Connection to several recent results. In this section, we discuss the 
connection of our main results to some other recent work on model selection 
in risk minimization problems, including [34, 36, 44]. 

6.1. Tsybakov. Our first example is motivated by the recent work of 
Tsybakov [44] (see also the earlier paper by Mammen and Tsybakov [35]), 
on fast convergence rates in classification. Let p%(f,g) '■= P(f — g) 2 ■ Define 
the expected continuity modulus u) n (fF\8) as in Section 3. For p G (0,1), 
K > 1 and C > 0, let Vp^fii^P) denote the class of probability measures P 
such that the following two conditions hold: 

(i) u n (P;5)<C6 1 -Pn- 1 / 2 ; 

(ii) D P (F;6)<C6&. 
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Theorem 8. Under conditions (i) and (ii), supp g p c ^)E£p(.F;/ n ) = 
0(n _2 "+Vi). 

This result generalizes Theorem 1 in [44]. Namely, using the standard 
Dudley's entropy integral bound on the expected continuity modulus of the 
empirical process under the condition that the L2(P)-entropy with bracket- 
ing of the class P grows as 0(tr 2p ) (see, e.g., [47], Theorem 2.14.2) yields 
condition (i). If 

(6.1) /*:=/* )F :=argminP/ and Pf-Pf*>c oP 2 P K (f,f*), 

feF 

then also condition (ii) is satisfied. The conditions above, being translated 
to the case of classes of sets (which was the case considered by Tsybakov 
whose paper dealt with the binary classification problem), are precisely the 
assumptions (Al) and (A2) in Tsybakov [44] and the rate of convergence 
( n 2k+ p -i ) i s th. e one obtained by Tsybakov. Of course, condition (i) will be 
also satisfied under many other assumptions common in empirical processes 
theory; for example, it can be expressed in terms of random entropies of the 
class. Also, the diameter D P (J-\ 5) in condition (ii) can be replaced by a more 
subtle geometric characteristic r(0;6) = rp(J 7 ; 0,6) defined in Section 4. In 
other words, condition (6.1) can be replaced by the following: 

(6.2) V/G^3/,GargminP/ = J-(0): P/ - P/* > c pf?(/, /*), 

IGF 

including the case when the risk Pf has multiple minima on P. Theorem 8 
holds in this case with only minor changes in the proof. 
Next we turn to model selection. 

Theorem 9. Consider a family {(Pj ,Vj)}i<j<N , such that Pj C P, 
:= 1~'pj,Kj,c(J r j) and for all P G Vj we have /^p G Tj. Moreover, assume 
that Pi C P 2 C • • • C P N , that for all PeP j; k*(P) = j (with k*(P) de- 
fined in Section 5.3) and that the numbers f3j := Kj/(2,Kj + pj — 1) satisfy 
the condition (3\ > 02 > • • • > 0N- Define k and f as in Theorem 7 (with 
tk '■= log A + 31ogn ; k = 1, . . . , n). Then 

max sup n^E(P/ — P/*) = 0(1) as n — > oo. 

Note that the result is also true if N = N n , where N n grows not too fast, 
say, so that for all S > 0, log N n = o{n s ) as n — > oo. This should be compared 
with Theorem 3 in [44] where another method of constructing an adaptive 
empirical risk minimizer was suggested in a more special classification frame- 
work and it was proved that the optimal convergence rate is attained at this 
estimate up to a logarithmic factor. Our Theorem 9 extends these types of 
result to a more general framework of abstract empirical risk minimization 
and refines them by removing the logarithmic factor. 
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6.2. Lugosi and Wegkamp. Next we turn to the results of a recent pa- 
per of Lugosi and Wegkamp [34]. Suppose that T is a class of measurable 
functions on S taking values in {0,1} (binary functions). As in Section 2, 
Example 6, A^(Xi, . . . ,X n ) denotes the shattering number of the class T 
on the sample {X\ , . . . , X n ). 

Given a sequence {Tk}, Fk C J 7 , of classes of binary functions, define the 
penalties 



7r(ife) := K 



logA^(X 1 ,...,X n ) + t fc 



inf P n f 



ir 



+ 



logA^{X 1 ,...,X n ) + t k 



n 



and 



7r(fc) :=K 



ElogA^(X 1 ,...,X n )+t fe 



' inf Pf 



n 



+ 



ElogA^{X 1 ,...,X n ) + t k 



n 



and let k solve the penalized empirical risk minimization problem (5.1), 
f •= f ■ 

J J n,k 



Theorem 10. There exists a choice of K,K such that for all t k > 0, 



{€ f) > bf { hf Pf - inf Pf + m}}<*E ^ 



k=l 



The development of penalization techniques that lead to these types of 
oracle inequalities was one of the major goals of the paper of Lugosi and 
Wegkamp [34]. A little bit sharper results obtained in this paper (involving 
the shattering numbers or Rademacher complexities of the classes J-k($k) 
for suitably chosen 5 k instead of the global shattering numbers) can be also 
recovered from Theorem 7 relatively easily (using Lemma 2). 

6.3. Massart. We consider now some recent results of Massart [36] that 
we formulate in a somewhat different form. Suppose that T is a class of 
measurable functions from S into [0, 1] and /* : S t— > [0, 1] is a measurable 
function such that with some numerical constant D > 



(6.3) D(Pf - Pf,) > p 2 P (f, /*) > P(f - hf - (P(f - 
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where pp is a (pseudo)metric. We will assume, for simplicity, that the infi- 
mum of Pf over T is attained at a function / G T (the result can be easily 
modified if this is not the case). Recall the definition of n {8) in Section 2. 
The following lemma will be crucial. 

Lemma 5. There exists a large enough numerical constant K > such 
that for all e G (0, 1] and for all t> 

U^t) < eUfPf - Pf,) + ±<&( 4-] + J 



t 



\r J J I D n \KD J e n 

It immediately follows from the lemma and Theorem 1 that 

*{pf-Pf. > (l + £ )(inf Pf-Pf.) + 1,» (^) + Ml } < log, 2Ur 

(and, due to Theorem 2, a version without the logarithmic factor holds with 
n replaced by an upper bound 9 n of strictly concave type). 

Now suppose that {J~j} is a sequence of function classes such that condi- 
tion (6.3) holds for each class J-j with some constant Dj > 1 (and with the 
same /*). Assume also that the sequence {Dj} is nondecreasing. We denote 
5 n (e;j) := Dj 1 6f l (e/KDj) and suppose that for any j there exist a data- 
dependent quantity 5 n (e;j) and a distribution-dependent quantity 5 n (e;j) 
such that Vj, F{S n (e;j) < S n (e;j) < 5 n (e;j)} > 1 —pj. Now we define the 
penalties as follows: 

Tt{e\j):= SS n (e; j) + and 7r(e;j) :=3>5 n (e;j) + ^ 

en en 

with some numerical constants K,K. Define k according to (5.1), / := ft- 
The next result follows from Lemma 5 and Theorem 6. 

Theorem 11. There exist numerical constants K,K such that for any 
sequence {t^} of positive numbers, 

p/p/ - Pf, > 1±£ inf { inf Pf - Pf, + jf (e; fc) } j 
L i — e fc>i k /6>)b J J 

//, in addition, Vj, V5 > : Q n (Fj\$) < n (.Fj; 5), where Q n {Tj\ •) = $ n ,:F,- (•) 
a function of strictly concave type, then one can replace S n (e;j) by 5 n (e;j) := 
DJ 1 ^ p. (e/KDj), the right-hand side of the bound being in this case X)fc^i(Pfc + 
4e"* fc ).' ° 
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This result has a number of applications. In a sense, most of the im- 
portant complexity penalties used in learning theory can be derived as its 
consequence. For example (pointed out already in [36]), if Tk are classes of 
binary functions and 

_ 6logA^(X 1 ,...,X n ) + Kt k 

ir{K) .— , 

n 

one can use Theorem 11, the bounds of Example 6, Section 2 and the devia- 
tion inequalities for shattering numbers [12] to get very easily the following 
oracle inequality: 

p {p/ - p, > C g(M P f - P,. + E2i^M)±^ }} 

oo 

< 5 

fc=l 

with some constant C > 1. One can also combine Theorem 11 with Lemma 1 
to obtain oracle inequalities for penalization method based on localized 
Rademacher complexities (defined in terms of continuity modulus of Rademacher 
process). 

7. Loss functions and empirical risk minimization. Let T be a measur- 
able space with cr-algebra T, and let (X, Y) be a random couple in S x T 
with joint distribution P. The distribution of X will be denoted II. Consider 
a sample (Xi, Yj_), . . . , (X n ,Y n ) of independent copies of (X,Y) and let P n 
be the empirical distribution in S x T based on this sample, while II n , will 
denote the empirical distribution in S based on the sample (X\, . . . , X n ). Let 
t : T x Em R + be a loss function. Given a class Q of measurable functions 
from S into M, consider the following risk minimization problem: 

m{Y,g{X))^mm, g^Q. 

If we denote {I • g)(x,y) := £(y; g(x)), then we can rewrite this problem as 
P(& •<?)—> min, g G Q, or 

Pf^min, f£F:=t*g:={£»g:geg}, 

so we are dealing with problem (1.1) for a class .F of special structure (the 
"loss class") and the results of previous sections can be specialized in this 
case. 

Let n x denote a version of conditional distribution of Y given X = x. Then 
the following representation of the risk holds under some mild regularity 
assumptions: 

P{t»g)= [ I £(y,g{x))fi x (dy)U(dx). 

JSJT 
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Given a probability measure fj, on (T,T), let G avgmm u£ ^J T £(y]u)fj,(dy). 
If 

g*(x) :=n M;c = argmin / £(y;u)fi x (dy), 

then we have (assuming that the function g* is well defined and measurable) 
Vg, P(£ • g) > P{£ • <?* ) , so g* is a global minimal point of _P(£ • g). 
The corresponding empirical risk minimization problem is 

n 

Pn{£ • 9) = n- 1 J2 ^Yf9{Xj)) - min, g 

3=1 

and g„ will denote its solution (we assume its existence for simplicity). The 
following assumption on the loss function I is very useful in the analysis of 
this problem. Suppose there exists a function D(u,fi) > such that for any 
measure fi = fi x , x G S 

(7.1) J (£(y, u) - £(y, u^)) 2 fi(dy) < D(u, /x) J (i(y, u) - £(y, u M ))/i(dy). 

In the case when the functions in the class Q take their values in the interval 
[-M/2,M/2] and D(u,fjb x ), \u\ < M/2, x G 5 is uniformly bounded by a 
constant D > 0, it immediately follows from (7.1) [by plugging in u = g(x), 
fi = fi x and integrating with respect to n(cfa)] that for all g G Q 

(7.2) P(£*g-£*g*) 2 <DP(£*g-£*g*). 

As a result, if 3* G G, then the L2(-P)-diameter of the <5-minimal set of 
D(T\8) < 2(D5) 1/2 . Moreover, even if ^ Q, the condition (6.3) still holds 
for the loss class J- with /* = £* g*, opening the way for Massart's penaliza- 
tion method in these types of problems. The idea to control variance in terms 
of expectation has been extensively used in [36] (and even in earlier work of 
Birge and Massart) and in learning theory literature [5, 6, 7, 8, 10, 37]. 

The analysis of risk minimization problems (in particular, proving the 
existence of checking condition (7.1), etc.) becomes much simpler under 
the convexity of the loss, that is, when for all y G T, £(y,-) is a convex 
function. The problems of this type are called convex risk minimization. 
Both the least squares regression and L\ -regression as well as some of the 
methods of large margin classification (such as boosting) can be viewed as 
versions of convex risk minimization. 

Assuming again that the functions in Q take values in [—M/2, M/2], we 
will introduce some even stricter assumptions on the loss function £. Namely, 
assume that £ satisfies the Lipschitz condition with some L > 0: 

(7.3) Vy G T, Vit, v G [—M/2, M/2] \£(y, u) - £(y, v)\ < L\u - v\ 
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and also that the following assumption on convexity modulus of £ holds with 
some A > 0: 

w <=t w c r m/o ast/o] £(y,u) + £(y,v) J_ u + v \ 
(7.4) y ' [—M/2,M/2\ [ y, ~2~J 



> A\u — v x 

Note that if is bounded by M/2, conditions (7.3) and (7.4) imply (7.1) 
with D(u, /u) < To see this, it is enough to use (7.4) with v = u^, fx = fi x 
and integrate it with respect to fi to get for L(u) := J T £(y,u)/j,(dy) (the 
minimum of L is at u„): 

L(u) - L(Ufj,) _ L(u) +L(u^) 

2 " 2 KJ 



{^)>A|«-«,| 



> L(u)+L( Ufl ) L f u + u fl \ ^ j 



and then to use the Lipschitz condition to get 

\£{y,u) -£(y,u^)\ 2 n(dy) <L 2 \u-u^ 



This nice and simple trick, based on strict convexity, has been used re- 
peatedly in the theory (see, e.g., [6]). We will use it again in the proof of 
Lemma 6. Sometimes a more general version of condition (7.4) is needed. It 
can be formulated as follows: 

w w !M-/o iLf/oi £(y,u) + £(y,v) ( u + v \ 
(7 5) y ' ' L~ M /2, M / 2 J 2 1 y; "IT" J 

> t/?(|u — f | r ), 

where ip is a convex nondecreasing function and r G (0,2]. The following 
lemma will allow us to bound the local complexities of the loss class J- = 
£ • Q in terms of local complexities of the class Q, which is often needed in 
applications. Let 

W„(S;t) = W„ f (S) :=W n (Q;S;t) 



:= C L0 n m; M~r\S/2)) + J *~WIW + V + i 

V n n 

where C > is a numerical constant and n is defined in Section 2.4. 

Lemma 6. Suppose that Q is a convex class of functions taking values 
in [—M/2, M/2]. Assume that the minimum of P(£»g) over Q is attained at 
g G Q . Under the conditions (7.3) and (7.5), there is a choice of numerical 
constants C and kw such that^5,t, U n (F;S;t) < W n (Q;5;t) and 8 n {/F;t) < 
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We are especially interested in the case when Q := Mconv('H), where TL is 
a base class of functions from S into [—1/2, 1/2] (see Example 5, Section 2.5). 
In this case, there are a number of powerful functional gradient descent- 
type algorithms (boosting algorithms) that allow one to implement convex 
empirical risk minimization over such classes. Assume that condition (2.1) 
holds for the class TL with some V > 0. Define 



7r n (M,L,A;t):=C 



I L \(V+2)/(V+l) _ 1X±1 L 2 t + 1 

AM V/(V+1 > —VI n 2v+i + 



A 7 An 



with some numerical constant C. The next result is essentially a slightly 
generalized version of a theorem due to Bartlett, Jordan and McAuliffe [6]. 
We will derive it as a corollary of our Theorem 2, using several nice ob- 
servations of Bartlett, Jordan and McAuliffe [6] (contained in the proof of 
Lemma 6). 

Theorem 12. Under the conditions (7.3) and (7.4), 5 n (T;t) < n n (M,L, 
A; t) and as a result 

F\P(£ • g n ) > mmP(£ • g) + vr„(M, L, A;t)\ < e _t . 

Because of the generality of the methods, the results can be easily ex- 
tended to other examples of convex risk minimization problems. For in- 
stance, let K be a symmetric nonnegatively definite kernel on S x S such 
that [if (x,x)| < 1 for all x £ 5. As in Example 7, Section 2.5, Hk is the re- 
producing kernel Hilbert space and Bk is its unit ball. Let Q := Qm '■= ^Bk- 
This example is of importance in the theory of kernel machines. Clearly, Qm 
is a convex class of functions and, by elementary properties of reproduc- 
ing kernel spaces, Vg G Gm,% £ S : \g(x)\ < M/2. We will use now slightly 
rescaled Mendelson's complexities of Example 8. It is easy to check (using 
Mendelson's inequalities of Example 8, Lemma 6 and the argument used at 
the beginning of the proof of Theorem 12) that 



M^«(^) + 



MA\ L 2 t + 1 



A n 



Tt n (M,L,A,t). 



With this new definition, the assertion of Theorem 12 still holds, and, more- 
over, based on the discussion in Example 7, one can replace in the bound 
the distribution-dependent Mendelson's complexity by its data-dependent 
version. 

An alternative to the approach of Lemma 6, exploited, for instance, in 
the paper of Blanchard, Lugosi and Vayatis [10], is based on a straightfor- 
ward comparison of L2(P n )-distances and the corresponding entropies for 
the classes Q and T = £»Q (which is easy under the Lipschitz assumption 
on £) and then bounding localized complexities of T using inequality (2.4). 
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It is not hard to combine the bounds of this type with model selection re- 
sults of Section 5 to obtain various oracle inequalities for model selection 
in convex risk minimization problems. In particular, in the case of model 
selection for a sequence of function classes Qk := M/%conv("H), where TC is a 
VC-class, one would easily obtain a slight generalization of a recent result 
of Blanchard, Lugosi and Vayatis [10] on convergence rates of regularized 
boosting algorithm. 

8. Comments on regression and classification. The general least squares 
regression is among statistical problems for which the penalization tech- 
niques have been very successful so far. In addition to already mentioned 
papers by Birge and Massart [8], Barron, Birge and Massart [3] and Mas- 
sart [36], we refer the reader to a book by van de Geer [46], a book by Gydrfi, 
Kohler, Krzyzak and Walk [22] and papers by Baraud [2] and Kohler [25]. 
Our goal here is only to outline the connection of regression problems to a 
more general theory considered in the previous sections. 

To simplify the matter, we consider only the case of least squares re- 
gression with bounded noise, that is, T = [0,1], £(y,u) := (y - u) 2 . Thus, 
the regression problem is a convex risk minimization problem and it is well 
known and straightforward that in this case g* is the regression function: 
g*(x) := E(Y\X = x). Given a class Q of functions g:S^> [0, 1], a solution 
g n of the empirical risk minimization problem (over the class Q) is a well- 
known least squares estimate of the regression function. The first problem 
of interest is to provide upper bounds on \\g n — <?*||L 2 (n)- 

To relate this to the general framework of convex risk minimization, note 
that in this case u M := argmin u / 1 (y — u) 2 ji{dy) = yfi(dy) and by a very 
simple algebra 

(£(y, u) - £(y, u M )) 2 = ((y - uf - (y - Ufl ) 2 ) 2 

= (u- u^) 2 {2y -u- u^) 2 < A(u - u M ) 2 

and 

(8.1) J Q ~ u v))v( d y) = J i(y- u ) 2 ~(y- u ^) 2 ]K d y) 

= (u-u^) 2 . 

As a result, condition (7.1) holds with D(u,fj,) =4. Note also that iden- 
tity (8.1) also implies (by integration) the formula P(£» g) — P(£» g*) = 
\\9 ~ 9*\\ 2 that immediately reduces the study of \\g n — 5*||| 2 (n) ^° excess 
risk bounds. 

These observations allow one to simplify the arguments used in the previ- 
ous section and to obtain the following result, using Theorem 1 and Lemma 5, 
more precisely; see the bound right after this lemma. In the case when 
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the class Q is convex, there is a way to improve the bound of the lemma. 
The key observation is that under the convexity assumption for all g G Q, 

lis ~ 9\\l 2 (u) < \\9 ~ S*ll! 2 (n) ~ Wd ~ 5*lli a (n) ( see ' e -S-' W' Lemma 20 - 9 )> 
which is a simplification and a specialization of the convexity inequalities 
used in the proof of Lemma 6. 



Theorem 13. Let Q n {$) :=#n(<7;<5) ■ =0 rit g(5). There exists a constant 
K such that for all e € (0, 1] 



°{ \\9n ~ 9* \\l m >U+£) mf ll! 2 (n) + K (£) + 



t + l 



en 



< lo Eq T e 



If Q is convex, then 



B {||5„-<7*ll 2 >mf \\g-g*\\ 2 + K(el^ + 



Moreover, if 9 n can be upper bounded by a function 9 n which is of strictly 
concave type, then one can replace 9 n by 6 n and drop the logarithmic factor 
in the bound. 



The significance of the above inequalities is related to the fact that in 
many particular cases of regression problem they allow one to recover asymp- 
totically correct convergence rates. This follows from computations of local 
Rademacher complexities in particular examples, given in Section 2.5. 

In the model selection framework, it is assumed that there exists a se- 
quence Qk of classes of functions (models) available for least squares re- 
gression estimation. Let g n ^ denote a least squares estimate in the class Qk- 
Given data-dependent complexity penalties 7r n (A;) associated with classes Qk, 
we define the penalized least squares estimator as follows: 



k := aremin 



n 

n~ 1 Y.0 r i-9n,k{X j )f + %{k) 

3=1 



g°n • gn %' 



It is very natural to use penalization techniques of Theorems 6 and 11 to 
design complexity penalties and to establish oracle inequalities for the cor- 
responding penalized least squares estimators. 



Example 1 (Dimension-based penalization). Suppose that for each k, 
Qk is a subset of a finite-dimensional subspace of L2 (II) of dimension dk and 
define Tr(k) := K dk+i ^ +1 where K is some numerical constant (see Example 1 
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of Section 2.5). The following oracle inequality holds with some constant 
C>0: 

P{llfc - *lL m £ C inf { inf || 9 - 9 .||i i(n) + <4ge"V 

Example 2 (Kernel selection with Mendelson's complexities). In this 
example, one is given a sequence {Kj} of symmetric nonnegatively defi- 
nite kernels on S x S, Qj being the unit ball in the reproducing kernel 
Hilbert space Hk (see Example 7 of Section 2.5). For each j, one can de- 
fine empirical Mendelson's complexity and true Mendelson's complexity of 
the class Qj, as in Section 2.5. We use the notations 7n,j( - ) — IniQj',') and 
Tnj(-) = ln{Qj\ •) and define Tt(j) := K(jIj(1) + ^p), where K is a numer- 
ical constant. Then, the following oracle inequality holds: 

Pjllffn - <7*||! 2( n) > lb " ^lli 2( n) + (<*(!) + ^) }} 

oo 2 

<4£iog ? ^ e -<*. 

fc=i fe 

Example 3 (Penalization based on Rademacher complexities). One can 
also use localized Rademacher complexities, defined in Section 2.4 (see Lemma 1), 
as general penalties for model selection in regression problems. Namely, given 
a sequence of classes Qk, we set 

m .^(^) + ^±l) ^ , M (<^) + tti) 

with some (large enough) numerical constants K, K. Here u> n k(') = &n(Qkl ■) 
and u) n) fc(-) = u) n {Gk'-> ")■ Then we have (for a penalized least squares estimator 
g n ) with some constant C 

oo 2 

P{ll5n - 9*\\l 2 (iL) > C jg 1 { g i g k \\9 ~ ^llL(n) + *(*0}} < 4 E e " ife - 

We turn now to binary classification problems. In this case, T := { — 1, 1} 
and the loss function is chosen as i(y, u) := I(y ^ u). The variable Y is inter- 
preted as an unobservable label associated with an observable instance X. 
Binary measurable functions g : S *— ► {— 1, 1} are called classifiers. The goal 
of classification is to find a classifier that minimizes the generalization error 
(the probability of misclassification) 



F{Y + g(X)} = P{(x, y):y + g(x)} = P(£ . g), 
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so the classification problem becomes a version of a risk minimization prob- 
lem with a binary loss function. Its solution always exists and is given by the 
following classifier (Bayes classifier): g*(x) := g*.p{x) = I(rj(x) > 0), where 
rj(x) := K(Y\X = x) is the regression function (see [15]). However, the distri- 
bution P of (X, Y) and the regression function 77 are unknown and the Bayes 
classifier is to be estimated based on the training data (X\, Y\), . . . , (X n , Y n ) 
consisting of n i.i.d. copies of (X,Y). This is done by minimizing the so- 
called training error 

n 

^ E T (Yj + 9(Xj)) = Pn{(x, y):y^ g{x)} = P n {£ . g) 
3=1 

over a suitable class of Q of binary classifiers, which is equivalent to empirical 
risk minimization over the loss class J- = I • G, and all the theory developed 
in the previous sections applies to classification problems. 

It is straightforward to check that condition (7.1) holds for binary loss 
I with D(u,p x ) = 1 y (moreover, the inequality in this case becomes an 

equality). If for some C > 0, a > 

Vt>0: U{x:0< \r)(x)\ <t}<Ct a , 
then it easily follows that 

(8.2) P{£ •g)-P{l*g*)> c oP 2 P K (£ •g,£mg m ), 

where pp(£ • gx,£ • g 2 ) ■= U 1/2 {x : gi(x) /^(z)} = Ii l,2 {gi -g 2 ) 2 , and k = 
(see [44]). To get k = 1, one can assume that for some to > 0, Il{a; : < 
^(sg) I < to} = 0. Roughly, the assumptions of this type describe the degree 
of separation of two classes in classification problem, or the level of the 
"noise" in the labels ("low noise assumption"). Now one can use Theorem 8 
of Section 6.1 to get the convergence rates in classification obtained first 
by Mammen and Tsybakov [35] and Tsybakov [44] . Namely, if V denotes a 
class of probability distributions on S x {—1,1} and Q is a class of binary 
classifiers such that, for all P £ V, g* t p £ Q, condition (8.2) holds (with the 
same k and cq) and the £2(11) bracketing entropy of the class Q is of the 
order 0(e~ 2p ) as e — > uniformly in P £ V for some p £ (0, 1), then for a 
classifier g n that minimizes the training error over Q we have 

sup [P{(x, y):y^ g n {x)} - P{(x, y):y^ g*,p(x)}] = 0(n _ ^+F r r). 
p&v 

This was the result originally proved by Mammen and Tsybakov [35]. They 
also showed the convergence rate to be optimal in a minimax sense [35, 44]. 
As a consequence of Theorem 9, it is also easy to get an improvement of 
the model selection result of Tsybakov [44] (see Theorem 3 there) in the 
sense that our version of adaptation gives the precise convergence rates 
(Tsybakov's bounds involve an extra logarithmic factor). 
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Unfortunately, minimization of the training error over huge classes of bi- 
nary functions (with entropy growing as £~ p ) is most often a computation- 
ally intractable problem. In so-called large margin classification algorithms 
(such as boosting and many algorithms for kernel machines) this difficulty 
is avoided by replacing the binary loss by a smooth (often, convex) loss 
function that dominates the binary loss, and using a version of functional 
gradient descent to minimize the corresponding empirical risk. In this set- 
ting, it is common to use real- valued functions g as classifiers. At the end, 
sign(g(x)) is computed to predict the label of an instance x. Let be a non- 
negative convex function such that 4>{u) > I(u < 0). We set £(y,u) := 4>(yu) 
and look at a convex risk minimization problem P(£ • g) — > min and its em- 
pirical version P n {£* g) — > min. Recently, Bartlett, Jordan and McAuliffe [6] 
and Blanchard, Lugosi and Vayatis [10] obtained reasonably good conver- 
gence rates for these types of algorithms. Their analysis is, essentially, a 
special version of somewhat more general analysis of convex risk minimiza- 
tion problems given in the previous sections. 

9. Main Proofs. 

Proof of Proposition 1. For the first part, note that 

To prove the second part, note that by induction 5 k is nonincreasing and 
takes values in [6, 1]. Denote d k := 5^ — 6. We have 



and since ip is of strictly concave type with exponent 7 and 5 k > 5, we get 

d k+1 < ^B(5i - p) < ^2^(4 - = s^di 

The result now follows by induction. □ 

Proof of Lemma 1. The first bound trivially follows from symmetriza- 
tion inequality 6 n (5) < 2Q n (5) and the definition of ^-transform. Let Sj := 
q~' J . In what follows 5 = 8i for some i. To prove the second bound, define 



E{5):=\u n {8)< sup \R n (f - g)\ + \ 2-{5 + 2uj n (5)) + |^ 
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n sup \(P n -P)((f-g) 2 )\<E sup \{P n -P)((f-gf) 
[P(f-g) 2 <6 P(f-g) 2 <s 



+ 2- (S + 2E sup \(P n - P)((f - 0)2)|) +M. 

It follows from Talagrand's concentration inequalities that ¥(E(5)) > 1 
2e~*. By symmetrization and contraction inequalities, 

E sup \(P n - P)((f - g) 2 )\ <2E sup \R n ((f - gf)\ < 8u n (S). 
P(f-g) 2 <6 P(f-g) 2 <6 

Therefore, on the event E(5), 



P{f-9?<5 =► P n (f -g) 2 < 5 + 8Q n (5) + 2^5 + 2^8^(5) + ^-, 

and using the inequality 2ab < a 2 + b 2 the right-hand side can be further 
bounded by 25 + l6u n (S) + 2 \. Assuming that 5 > q~ 1 Co^ q (e) > ^, and using 
the monotonicity of Q n , we get 

u n {o) < o sup — — — < 5 sup — -r^— 

8 3 > q -^li q (e) d i «i>?- 1 «aJl ,9 (B) °i 

< qo sup — — < qeo. 

Therefore, for e G (0, 1] and <5 > q~ 1 u!^ q (e) > t/n, on the event E(5) 

2t 

P(f-g) 2 <S =► P„(/- 5 ) 2 <25 + 16cu n (<5) + -<(4 + 16(?)5. 

n 

Also, on the same event and under the same conditions, 



t 8t 
u n (5)< sup \R n (f-g)\ + J2-(5 + 2u; n (6)) + — 

P(f~g) 2 <t V n 3n 



sup ^ U - g) \ + J 2 U + 2J*M* + %. 

P n (f-g) 2 <{±+m)5 V n V 2 re 3n 



< sup | jRn (/_£)|W2-5+ — + _ + ^-^, 

^n(/-9) 2 <(4+i6 9 )5 \ n 6n n I 

where we again used the inequality 2ab < a 2 + b 2 . Therefore, on the event 
E{S) 

uj n {5)<2 sup \R n {f-g)\+2V2~J-5 + — 

^n(/-3) 2 <(4+16g)<5 V n n 



2^ n ((4 + 16 g )(J) + 2V2JU + — =: ^{5) 

V n n 



40 V. KOLTCHINSKII 

as soon as S > q~ 1 ui'^ 9 (e) > ^. 

Note that if q~ 1 ujli q {e) < ^, then the second bound of the lemma is triv- 
ially satisfied. Otherwise, denote 

E:= H E (*J)- 

j--S J >q- 1 ^"(e)>± 

Clearly, F(E) > 1 - 21og„ x e_< ' an< ^' on tne event E, we have u) n {Sj) < ip(Sj) 
for all Sj > q~ 1 iD^ q (e), which implies that (see Property 2' in Section 2.3) 
&h 9 ( £ ) — $ (s)- Using the properties of {(-transform, this yields by a simple 
computation that 



4( e )<c(4(ce) + ^) 



with some constants C, c depending only on g. 

To prove the third bound, we introduce the following event: F :=f] s > ± F(5j), 

3 — n 

where 



F{5):={ sup \R n (f-g)\<u n (c q 5) + d2-(c q 5 + 20 n (c q 5)) + ^- 
\P(f-a) 2 <c q s v n in . 



n sup |(P re -P)((/- g f)\ <E sup |(P n -P)((/-«^ 

l^(/-9) 2 <5 P(/~9) 2 <5 



+ /2-(<y + 2E sup |(P n - P)((/ - + ± 

with a constant c q depending only on q to be chosen later on. It follows 
from Talagrand's concentration inequalities that P(P) > 1 — 21og„^e _ *. 
Let 5 = Si for some i and Si > . On the event F the following implication 
holds: 

Pn{f-g?<5 and P(/-<7) 2 e(^ + i,<y 

=> ^ = <5 i+ i<P(/-<?) 2 <<5 + sup |(P n -P)((/ 

< d + 16u; n (d,-) + H , 

q z n 

where we used the same computation as in the previous part of the proof 
with minor modifications. If Sj > ui% 9 (e), then O n (5j) < eSj, and we can get 

n 
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If £ < ^{q" 1 — q~ 2 ) (note that it is enough to prove the bound under this 
restriction and the general case would follow by changing the constants), 
then we get that 

(4/3 + q 2 /2)f 



S j <2(q- 1 -q- 2 r 1 (8 + 



n 



What we proved so far can be formulated as follows: on the event F, for 
P n (f-g) 2 <5 

^ P if - 9 ? < 2(g - - (* + m±jW) v 

which means that for 5 > Ld^ q (e), P n (f - g) 2 < S =>• P(/ - #) 2 < c q 5 with a 
constant c q > 1 depending only on g. This allows us to conclude that on the 
event F for all 5 = 5i> u$«(e) V £ 



w„(5) < sup |-Rn(/ - 9)1 < w„(c g (5) + \/2-(cqS + 2uJ n (Cg5)) + — 
P(f-a) 2 <c q 8 v n 3n 



/ t 2t 
< 2uj n {c q 5) + J 2c q S- + — =: ip(8). 
V n n 

Next we use the basic properties of the jj-transform to conclude the proof. 
Since ip{5) > uj n {5) V we get for all e G (0, 1], ip^ q (e) > w| 9 (e) V £. Thus, 
for all 8 > i/fl*(e), w n (S) <^{S), implying that &| 9 ( e ) < ^*'«(e). Now it is 
easy to conclude that on the event F 

4( £ )<c(4( ce )4--L 

with some constants C, c depending only on q. 
The proof for lo\ r is similar. □ 

Proof of Theorem 1. Let 

E nd (t):=\ sup |(P n -P)(/- 5 )|<[/ n (<^)}. 

By Talagrand's concentration inequality, F((E n j(t)) c ) < e - '. Let Sj > 5. 
Since on the event E n j(t), 

f n G ^(^j+i,^] 

Ve G (0,^ + i ) V 5 G^(e) 

S j+ l<£(fn)<Pfn-P9 + £ 
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<PnL- Pn9 + {P~ Pn){fn ~ g) + £ 
<4(/n)+ SUp \(P n -P)(f-g)\ +£ 

<U n {S j ;t)+s<V n (S;t)S j +e 

V n (6;t)>->±- 
q 2q 

=> S<uk«(^)=8 n (t), 

we can conclude that, for Sj > 5 > 6 n (t), {f n £ J-*(<5j-+i, Sj]} C (E n j(t)) c . 
Therefore, for 5 > S n (t), on the event E n (t) := Plj <s^ >5 En,j(t) we have £(f n ) < 
5, implying that 

P{^(/n)>5}< ^ P((i? nj ( i )) c )<log g |e-'. 

j :*,■>* 

Now, on the event E n (t), we have f n € and for all j such that <5j > 5 

f€?(6 j+1 ,6 j ] 

=► Vee (O.ffj-) V#e JF( £ ) 

S(f)<Pf-Pg + £<P n f-Pn9+(P-Pn)(f-9)+£ 

< £ n (f) + t) + e < 4(/) + t)^ + e 
<£ n (f) + qV n (S;t)£(f) + e, 

which means that on this event £(f) >S^- £ n (f) > (1 — qV n (5; t))£(f). Sim- 
ilarly, we have on E n {t) 

feF(6 j+1 ,6j] 

£ n (f) = Pnf ~ Pnfn <Pf~ P fn + (^n ~ P)(f ~ fn) 

< £(f) + U n {S f ,t) < £{f) + V n {6; t)5j 

< £(f) + qV n (S; t)£{f) = (1 + qV n (5; t))£(f), 

so that £ (/) > 6 => £ n (f) < (l + qV n (5; t))£{f). Since P((£„(i)) c ) < log, \e~\ 
the result follows. □ 

PROOF of Lemma 2. Consider the following event: 

E := {V/ G ^ with £{f) > 5 n (t) : \ < ^ < |}. 
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It follows from Theorem 1 and the definition of 5 n (t) that ¥(E) > 1 — 
logg j-^ye~*. Consider also 

F:={ sup \(P n -P)(f-g)\<U n (5 n (t);t)}. 
It follows from the concentration inequality that F(F) > 1 — e~ t . Therefore, 

On the event E, we have 

(9.1) V/G^: £(f)<2£ n (f)V5 n (t), 

which implies that for all 5 > 5 n (t), J~ n ($) C J- (25). On the other hand, on 
the same event E, V/ £ F:S(f) > 5 n (t) S n (f) < §£(/). 
On the event .F, 

£(/)<*«(*) =► 4(/) <£(/)+ sup |(P n -P)(/- 5 )| 

<£(f) + U n (5 n (t);t) 
< 5 n (t) + gF,, (*„(*) ;*)*„(*) < |<5„(t). 

Thus, on the event E n F 

(9.2) V/£^: 4(/)<|(£(/)V $„(*)), 

which implies that V<5 > <5 n (t) : .F(<5) C fi n (35/2). □ 

Proof of Theorem 2. It is similar to the proof of Theorem 1, but 
now our goal is to avoid using the concentration inequality many times (for 
each 5j) since this leads to a logarithmic factor. The trick was previously 
used in [36] and in the Ph.D. dissertation of Bousquet (see also [5]). Define 

& : = U -f{/-3:/,2eJ^.)}. 

j:Sj>5 °3 

Then the functions in Q$ are bounded by 1 and 

<?p(G&) < sup ^ sup ap(f-g)<5 sup D ^ <D(5), 
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since D is of concave type. Also, since <p n is of strictly concave type, Propo- 
sition 1 gives 

E\\P n -P\\g s =E sup j- sup \{P n -P)(f-g)\ 
< T E SU P \(Pn-P)(f-g)\ 

j:Sj>6 °3 

Now Talagrand's concentration inequality implies that there exists an event 
E of probability F(E) > 1 — e~* such that on this event \\P n — P\\g s < U n {^', t) 
(the constant K in the definition of U n (5; t) should be chosen properly). Then 
on the event E 

Vj with Sj > 5: sup \{P n - P)(f - g)\ < -{U n (5;t) < t^*;*)^-. 

The rest repeats the proof of Theorem 1. □ 

Remark. There is also a way to prove a bound on £p(f) based on 
the iterative localization method described in the Introduction and in the 
second statement of Proposition 1. Namely, one can assume that both <p n 
and D are of strictly concave type with exponent 7 E (0,1). As a result, 
the function \J n ,t is also of strictly concave type with the same exponent. 
If now 5 n (t) denotes its fixed point, then by Proposition 1(h), the num- 
ber iV of iterations needed to achieve the bound 5n < 2<5 n (t) is smaller than 
loglog 2 ((l — 5 n {t))/5 n {t))/\og{l/^) + 1 in the case when 5 n {t) < 1/2 and 
N = 1 otherwise. Thus, the argument described in the Introduction imme- 
diately shows that ¥{£ P (f) > S n (t)} < iVe - *. This approach was first used 
in [27] (and later also in some of the arguments of [5]). 

Proof of Theorem 3. The proof consists of several steps. Through- 
out, H will denote the event introduced in Lemma 2. According to this 
lemma, we have ¥(H) > 1 — log^ e . 

Step 1. Bounding the Rademacher complexity. Using Talagrand's concen- 
tration inequality, we get (for 6 > and t > 0) on an event F = F(5) with 
probability at least 1 — e~ l 



E sup \Rn(f-g)\< sup \Rn(f-g)\ 

+ /-(Z?2(J) + 2E sup \R n (f-g)\)+^, 
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which implies that 

E sup \R n (f-g)\< sup \R n (f- g )\+D(5)J^+^ 



1 2t 
+ 2/-E sup \R n (f-g)\- 

I 04 Of 

< sup \R n (f-g)\ + D(S) x /-+ ' 



re 3n 



1 2t 
+ -E sup \R n (f-g)\ + -, 

2 f,g&HS) 71 



or 

E 



/t 28i 
- + !-• 
re ore 



/,ff&F(<5) /,9G^(<5) 

This can be further bounded using Lemma 2. Namely, for all 5 > 5 n (t), we 
have on the event H D F that 



E sup \R n (f-g)\<2 sup |J4(/- 5 )|+2v / 21}(S)J^ + 



t 28t 
3n 



Step 2. Bounding the diameter D(5). Again, we apply Talagrand's con- 
centration inequality to get on an event G = G{5) with probability at least 
l-e -t 

D 2 (5)= sup P(f-g) 2 

< SUp P n {f-gf+ sup \( Pn -P)((f-gf)\ 

f,g^(S) f,ger(5) 

< SUp P n (f-gf+E Slip \(P n - P)((f - gf)\ 



+ J-(d 2 (6) + 2E sup \(P n -P)((f-g)2)\) + 

where we also used that sup /i5eJF(5) Var P ((/ - #) 2 ) < sup f ge:F{5) P(f - g) 2 = 
D 2 (5), since the functions from T take their values in [0, 1]. Using the sym- 
metrization inequality and then the contraction inequality for Rademacher 
processes, we get 

E sup |(P n -P)(/-<7) 2 | <2E sup \R n ((f-g) 2 )\ 
f,g&H$) f,g£HS) 

<8E sup \R n (f-g)\. 

f,g£HS) 
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It follows from Lemma 2 that for all 5 > 5 n (t) on the event H we have 



sup P n (f-gy< sup p n {f- g y = Di[-8 
Hence, on the event H flG 

^<^ft)+SE sup \Rn(f-g)\ +D{8)M 
V 2 / /,s&F(«S) v " 



+ 2/-E sup + 



(3 \ / 2f 9i 

-5 +9E sup | J R n (/- 5 )| +jD ( ( jW- + - ) 
f,g£F(5) V n n 



where we applied the inequality 2V ab < a + b, a,b>0. Next we use the 
resulting bound of Step 1 to get on H n F n G 

^ 2 (^) < §*) + 18 sup - 5 )| + 19D(6)J* + — . 

V y f,gefi„(3S/2) v n n 



As before, we bound the term 19D(<5)y ^ = 2 x 19-^ y | using the in- 
equality 2a6 < a 2 + 6 2 and this yields 

D 2 (5) < I^ 2 (5) + bl (\s) + 18 sup |iU/ - 5)1 + — ■ 

2 V/ 7 f,geA(3S/2) n 

As a result, we get the following bound holding on the event H f] F f] G: 

lOOOt 



D\5)<2Di[-5) +36 sup |i?„(/-<?)|- 
which also implies 



f,ger n (3S/2) n 



D(5)<V2D n (h)+Q I \R n (f-g)\ + —. 

Kl J y/, 5 eA(35/2) n 

Step 3. Bounding U n in terms of U n . We use the bound on D{5) in terms 
of D n (^5) (Step 2) to derive from the bound of Step 1 that 

E sup \Rn(f-g)\<2 sup - g)\ +4D n ( h) 

f,9^(S) f,gef n (3S/2) \2 / \ n 
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= 3 sup 1 

f,g€A(3S/2) 



n n 



which holds on the event H n F n G. By the symmetrization inequality, we 
also have 



t 344t 

E sup |(P re -P)(/- 5 )|<6 sup | J R n (/- 5 )|+8D n (^)^/- + - 

f,9£HS) f,g€? n (36/2) 



n n 



which holds on the same event. Recalling the definition of U n and U n , the last 
bound together with the bound of Step 2 shows that with a straightforward 
choice of numerical constants K, c the following bound is true on the event 
HnFnG: U n (S;t)<U n (S;t). 

Step 4. Bounding U n in terms of U n . The derivation is similar to the 
previous one. First, by Lemma 2 and Talagrand's concentration inequality, 
for all 6>6 n (t), 

sup \Rn{f-g)\< sup \R n (f-g)\<M sup \R n {f - g)\ 



on the event H n F', where F' = F'(5) is such that P(F') > 1 - e _i . Next, 
using the desymmetrization inequality, 

e sup \Rn(f-g)\ 

/,96^(25) 

<E sup \R n (f ~ g ~ P(f ~ g))\ + sup |P(/-<?)|E|P n (l)| 

/,fleJF(2<5) f,germ 

<2E sup |(p n -P)(/- p )|+„-V2 sup pi/2 (/ _ 5) 2 
f,geT(26) /,se^(25) 

<2 < /. n (25)+n- 1/2 J D(25). 

Therefore, we get (by getting rid of <^> n under the square root) 

sup |i? n (/- 5 )|<40 n (2<5)+ J D(25)( -L + V2*/l ) +-. 

We turn now to bounding the empirical diameter D n {5). Again, by Lemma 2 
and Talagrand's concentration inequality, we have for all 5 > 5 n (t) on the 
event H n G' , where G' = G'{5) is such that P(G') > 1 - e~\ 

D 2 n {5):= sup P n (f-gf< sup Pn(f-g) 2 

f,ge?n(S) /,96^(2«5) 
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< sup P(f-g) 2 + sup \(P n -P)((f- g f) 
f,geF(25) f,ge^{25) 

<D 2 (26)+E sup \(P n -P)((f-g) 2 )\ 



+ 2- (d2(25) + 2E sup \(P n -P)((f-g)2)\) + ±. 
As in Step 2, we use symmetrization and contraction inequalities to get 



E sup \(P n -P)((f-g) 2 )\ <8E sup \R n (f - g)\, 



and then using the desymmetrization bound, as in Step 3, to get 
E sup \(P n -P)((f -gf)\ <l^ n {25)+? 1 



By a simple computation this implies that 

b 2 n {5) < D\25) + 320 n (25) + £>(2«J) (M + + -. 

\ V n \ n I n 



The same algebra we already used in Step 3 yields the inequality U n (S;t) < 
U n (5; t) that holds on the event H n F' C\G' with properly chosen numerical 
constants K,c in the definition of U n . 

Step 5. Conclusion. Using the inequalities of Steps 4 and 5 for 5 = 5j > 
5 n (t) gives 

P ^^-( l0g ^ + 4l0g ^) eXp{ - t} ' 

where 

E := {V5j > SnM-.Unfat) < Un(6y,t) < U n {S j; t)}, 

since 

Ed (J {HnF(6 s )nG(6 j )nF'(6 j )nG?(8 j )). 

j : 5j>8~ n (t) 

Applying to ip(6) := U n ,t{8) property 7' of the (J, ^-transform, we get with 
c = q 2 

Therefore, using property 2' of the ft, (/-transform, we get on the event E 
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and then, repeating the same argument for 6 n (t), that 

implying the result. □ 

Proof of Theorem 4. Denote 

fa(a,S):=K sup sup \(P n - P)(f - g)\. 

g&F{&) feF(5),p P (f,g)<r(a,S)+e 

Clearly, tp^(a,5) I ip n (a,5) as e J, 0. Define 



XJlia- 5; t) := ^(a, <S) + \/2-((r(a, 5) + ef + 2^(<r, 5)) + 

V n 3n 

We also have U^(a;S;t) | U n (a;5;t) as e J, 0. Let 

E nd (t;e):={ sup sup |(P n - P)(/ - 5 )| < fyt)}. 

L <?e^( CT ) /ef(i J ),Pp(/ ; 9)<r>,^)+ E J 

By Talagrand's concentration inequality, ¥((E n j(t;e)) c ) < e~ l . Hence, for 

E n {t-e):= f) E nJ (t;e), 

j:8j>5 

we have ¥((E n (t;e)) c ) < log q |e _t . On the event E n (t;e), for all j such that 
Sj > 6, 

f G J-(5j+i,5j] 3geF(a): p P {f,g)<r(a,5 j ) + £ 

=> £(f)<Pf-Pg + a 

<Pnf-Pn9+(P-Pn)(f-9)+<T 

<£ n {f) + U e n {a,8 f ,t) + a. 

Therefore, 

¥{3j : 3f G Ftfj+uSj] : Sj > 8, £(f) > £ n (f) + W n (a, S f ,t) + a} < log, | e -*. 
Let 

F := {3/ €F:S(J) > 5 and £ n (f) < (1 - g^(a, S; t))S (/)}. 

Then, 

F C {3j 3f G F(5 j+1 ,Sj] : Sj > 5, £{f) > £ n (f) + V n (a, 5; t)Sj} 
C {3j 3f G F(Sj +1 ,Sj) : <fj > 5, £{f) > 4(/) + 5f,t) + a}. 
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Because of the monotonicity of with respect to e, 
¥{3j 3f £ FijSj+uSj] : Sj > 5, £(f) > £ n (f) + U n (a, 8 fit) + a] 

= lim F{3j 3f £ ^(<5 i+1 , 8j] : ^ > 5, £(f) > 4(/) + t) + a} 

<limsupP((E n (t;e)) c )<log 9 |e"*, 

implying P(F) < log g |e - '. This proves the second bound of the theorem and 

it also implies the first bound since on the event F c , £(f n ) < 8; otherwise, 
we would have 

= £n(L) > (1 - qV n (cr,8;t))£(f n ) > 5/2, 
a contradiction. □ 

Proof of Proposition 2. We have Pf = 1/2 for all / £ T and as a 
result T{S) = T for all <5 > 0. This implies V0 < a < 5 : r(a; 5) = and also 
tp n {a;5) = 0. Therefore, <!> n (cr;i) is of the order Ct/n. Note also that Vk ^ j: 
P{fk ~ fjf = 1/2, so, D P {F;8) = 1/2. On the other hand, 

4> n {8)=V sup \(P n - P)(f - g)\ = E max \(P n -P)(f k _/.)|, 

which can be shown to be of the order c(log TV/ n) 1//2 . This easily yields the 
value of 8 n (t) of the order c((log N/n) 1 / 2 + {t/n) 1 / 2 ). The excess risk of /„ 
(and, as a matter of fact, of any / £ F) is 0, so the bound 5 n (t) is not sharp 
at all. Next we show that (iv) also holds. To this end, note that 

P{^(0) C f n (8)} = nfn(5) = T} 

= ¥{yj,l<j<N+l:P n f j < min P n f k + S\ 

< P{Vj, l<j<N: PrJj < P n f N+1 + 8} 

= P{Vj, 1 < j < N : u nd < u n + 

where v n ,v n j, 1 < j < N, are i.i.d. binomial random variables with param- 
eters n and 1/2. Thus, we get 

n 

P{^(0) C P n {5)} < F K = k}F{\fj, 1 < j < N : v n>j <k + 5n\v n = k} 

k=0 

n N 

= = k}H P{^n,i < k + 5n} 

k=0 j=l 

n 

= P K = k}¥ N {v n <k + 5n} 

k=0 
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< P{i/ n >k} + F N {v n <k + 5n}, 
where < k < n. Let k = S + n5. Then, using Bernstein's inequality, we get 



"{^n > k} < exp 



(log iV)" 



On the other hand, using normal approximation of binomial distribution we 
get ($ denoting the standard normal distribution function) 

P{v n <k + 5n}< $(45y/n) + n~ 1/2 = $(y/fogN) + ra" 1/2 . 

Under the condition Nq < N < yjn this easily gives (for a large enough Nq) 
P{JT(0) C JF n {5)} < e, which implies the claim. □ 



Proof of Lemma 3. First note that by Theorem 1 the event {£(f n ) < 



5 n (t)} holds with probability at least 1 — log 

have for all g G with e < S n (t 

i 

(9.3) 



1 6 n (t) 



On this event, we 



infP n /-mfP/ = P n f n -miPf 



< Pf n - MPf +\(P n - P)(f n -g)\ + \(P n - P)(g)\ 

<6 n (t)+ sup \(p n -P)(f- g )\ + \(p n -P)(g)\. 

f,geF(5n(t)) 



By Talagrand's inequality with probability at least 1 — e t 

sup \{P n -P)(f -g)\ <U n (5 n (t);t)<qV n (d n (t);t)S n (t)<S n (t). 
f,geT{8 n {t)) 

(9.4) 

On the other hand, by Bernstein's inequality, also with probability at least 
1-e - * 



(9.5) |,P„-P to ,|<; 2 iva rp9+ |< 1 / 2 i(,nfP/ + £ ) + |, 

since g takes values in [0,1], g G -P(e), and hence Varpg < Pg 2 < Pg < 
inf^rP/ + e. It follows from (9.3), (9.4) and (9.5) that on some event E(e) 
with probability at least 1 — log -rT^e~ l the following inequality holds: 



(9.6) 



/2^(infP/ + e)+- 



n 



Since the events E(e) are monotone in e, one can let e — > which yields the 
first bound of the lemma. 
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To prove the second bound, note that on the same event on which (9.6) 
with e = holds we also have 



infpj-infP/ 



< W2- 
V n 



t 



t 



inf P n / - inf Pf + 25 n + W 2- inf P„/ + - . 



T 



T 



n T 



n 



(9.7) 

We either have 



MP n f-MPf 



8t 2t \MrPJ-MrPf\ 

< — or — < 

n n 4 



and in the last case (9.7) implies that 



t If 

inf P n f - inf Pf < 45 n (i) + 2 ^2- inf P n f + - . 



n J 17 



We can use now the condition of the lemma to replace 5 n (t) by 5 n (t) and to 

3 

get that with probability at least 1 — p — log q yj^^ 1 the following bound 
holds: 

/ t. Rt 

□ 



It 8t 
inf P n f - inf Pf < 45 n (t) + 2 J 2- inf P n f + -. 

T T V n ■ t n 



Proof of Theorem 5. We will use the following consequence of Theo- 
rem 1 and of Lemma 3 (and its proof) : there exists an event E of probability 
at least 



k=l 



tk 



such that on the event E, \/k > 1: 



Pf k - inf Pf < 6 n (F k ;t k ) < 6 n (F k ;t k ) < 5 n {T k ; t k ) 

f£Tk 



and 



infP n /-infP/ <25 n (F k ;t k ) + 
Tk Tk 



'^infP/ + ^, 

n T k n 



inf P n f - inf Pf < 45 n (F k ;t k ) + 2 inf P n f + ^. 

Tk Tk 



n T k 



ir 



Note also that the events involved in the proof of Lemma 3 are the same 
that are involved in the bound of Theorem 1; because of this reason, we do 
not have to add probabilities here. On the event E, we have 

Pf = Pf k <miPf + U^t k ) 

^k 



LOCAL RADEMACHER COMPLEXITIES 



53 



< inf P n f + 5 S n {F* ■ t A + 2 J m P n f + ^ 



v 



<MP n f + 7r(k) = inf inf P n f + tt(/c) 



k LjF fc 



provided that the constant K in the definition of tt was chosen properly. 
This proves the first bound of the theorem. 
To prove the second bound, note that since 



:fc infP„/< J l ±MPf + 



n T) 



n T h 



Itk 
n 



MP n f -MPf 



< 



we also have on the event E for all k 
n(k)=K 

K 



n T h 



n 



< 



5 n (Fk;t k ) + J-MPf+ i 



n r h 



n 



■ 7r(k)/2 



and 



infP n /-infP/ <28 n {T k -t k ) + 



n T h 3n 



< 



K 



^infP/ + ^ 

n T k n 



if (*0/2, 



provided that the constant K in the definition of Tr(k) was chosen to be large 
enough. This yields on the event E 



Pf<M MP n f + ir(k) 
k lF k 

proving the second bound. □ 



<inf 

k 



MPf + jr(k) 



Proof of Lemma 4. We assume, for simplicity, that Pf attains its 
minimum over Q at some f EG (the proof can be easily modified if the min- 
imum is not attained). Let E be the event such that the following inequalities 
hold: 



2t - t 

|(P„ - P)(/ - /•)] < \l- Var P (/ -/*) + - and 
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V/ e G : £ n (G; f) < -(£ P (G; f) V 5 n (G; t)). 

The first of these inequalities holds with probability at least 1 — eT l by 
Bernstein's inequality; the second inequality takes place with probability 
at least 1 - log^^e"* by (9.2) in the proof of Lemma 2. Hence, F(E) > 

1 - logg^e'K We also have Varp /2 (/ - /») < ^(P/ - P/*) and hence, 
on the event E, 



|(P - P n )(/ - < <p(VZ<p-\Pf- Pf*)) + <P*[\—)+- 



11 
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implying 
(9.8) 



2£ 



Pn(f-f*)<(l + ^))P(f-f*)+V*[ \ —)+- 



en 



n 



and 



(9.9) P{f -/*)<(!- ¥>( Vi))" 



Pn{f-f*) + <P* [\— )+- 



en 



n 



Equation (9.8) immediately yields the first bound of the lemma. Since on 
the event E 

Pn (I ~f*)=Pj- mf P n f + mf P n f -P n f* = 4 (0; /) + inf P„/ - P n f* 
y y G 

< inf P n f - P n U + \{Sp{9\ I) V 4(0;*)), 

and since £p(G; f) = 0, we get 



Along with (9.9), this implies 



inf Pf - Pf* = P(f-U) < (1 - 



infP n /-P n /, + -a n (0;t) 



+<p \ — +- 



which is the second bound of the lemma. 

Finally, to prove the third bound plug into (5.5) the bound on 8 n {Q;t) 
and solve the resulting inequality with respect to infg Pf — Pf*. □ 
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Proof of Theorem 6. Let E k be the event defined in Lemma 4 for 
G = J~k and t = t k . Let E be the event such that the following inequalities 
and events E k hold for all k: 

£p(F k ; fk) = Pfk - inf Pf < 5 n {F k -t k ) 

•Fk 

and 5n{Fk'-,tk) < 5n(-^fc;ifc) < ^ni^k'^k) ■ The first of the inequalities holds 
with probability at least 1 — log^ f^e _ifc either by Theorem 1 or by Theo- 
rem 4; the second one holds with probability at least l—p k by assumptions. 
Therefore, using Lemma 4, 

oo , 2 \ 

n#) > i - E (p* + 2 lo § 9 e ~ tk ) ■ 

On the event E, using first bound (5.5) and then (5.4) of Lemma 4, we get 



£ P (T- f) = Pf- inf Pf = Pf~ k - Pf* = Pf k ~ mf Pf + inf Pf - Pf, 

■ / 1 ^k 



<6 n fati) + wiPf-Pf, 



<(l-v(Vi))" 



(1 - ^(A/i))4(^;tfc) + mf Pn/ " Pn/* 



+ o <y n(^;*fc)+¥'* 1 

2 fc s V en 




n 



<(l-^)r mf 



inf P n / + (5/2 - tp(y/I))5 n (F k ;t k ) 

■Fk 



en / n 



Pnf* 



(1 - ^(y^r 1 ! inf [inf P n / + 7T(fc)l - P n f 



<i±4#nf 

" 1 - ip(y/e) k 



= infi±4^ 
and the result follows. □ 



+ 



1 + <p(Ve) 




+ 



l + ^(v / i)) n 



infP/-infP/ + 7r(fc) 
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Proof of Theorem 7. Let us define the event E such that on this 
event V/ and V/c < I 

(9.10) inf S n (F h f) < 2( inf £p(J r i,f) V ^(.F,,*, 

(9.11) inf f P (^i,/) < 2 inf £ n {T h f) V * n (.F,,t,), 
and 

(9.12) 4(F^) < b n (TvA) < 5n(Fi;ti). 
Then we have 

oo , 2 

F(E)>l-Y,[Pk + log q ^e-^ 
k=i V Ifc 

which is true because of the following reasons. First, for any I, we have with 

2 

probability at least 1 — log^ s ^- ^ e~ tj that for all / G Fj 

t n {T h f) < 2{8 P {T h f) V and £p{T h f) < 2£ n {F u f) V ^(.F,, f,) 

[see the proof of Lemma 2, specifically, (9.1), (9.2)]. Then, by assumptions, 
for all I with probability at least 1—pi, <5 n (F;;tz) < <5 n (F/;t/) < 6 n (^i;ti). It 
remains to use the union bound to get the above lower bound on ¥(E). 

Clearly, on the event E, Vl:8 n (l) < 6 n (l) < 5 n (l). We will show that on 
the same event E, k < k <k <k* . The inequality k < k* is obvious from the 
definitions. If k < k, then there exists I > k such that 

inf 4(^,/) = inf P n f - MP n f > c5 n (l). 

Fk Fl 

We will use that, due to (9.10), on the event E 

inf £ n {T h f) < 2 (inf £ P {T h f) V 5 n (l)) . 

Fk y F k ' 

Therefore (assuming that the constants c, c have been chosen properly) 

inf Pf - inf Pf = w££ P (n,f) > - S n (l) >(%-l) 4(0 > cS n (l), 

which implies that k <k and hence k < k. Similarly, if k < k, then there 
exists / > k such that 

inf £ P (F h f) = inf Pf - inf Pf > ~c~5 n {l). 

J~k J~k J~l 

Due to (9.11), on the event E 

inf £ P (F, , /) < 2 inf £ n (F t , f) V 5 n (l), 
Fk Fk 
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implying that 

infP n /-infP n / = inf4(^,/) > (cS n (l)-5 n (l))/2 > f^)s n (l) > cS n (l), 

•Fk -Fi F k \ I J 

provided that the constants have been chosen properly. Therefore, k < k and 
hence k <k. 

Next we have on the event E for all k > k 

Pf - inf inf Pf = Pf t - inf Pf + inf Pf - inf inf Pf 

J J F S J Jk F k J F k J j F 3 J 

= Ph - inf Pf + inf Pf - inf Pf + inf Pf - inf inf Pf 

Jk F- k J ^ F h J F k J ^ F k J j F t J 

< 5 n {k) + inf Pf - inf Pf + inf Pf - inf inf Pf 

F k F k F k j Fj 

< 5 n (k) + c5 n ( k) + inf Pf - inf inf Pf 

F k j Fj 

< inf Pf - inf inf Pf + (c + l)S n (k), 

F k j Fj 

implying the first bound. The second bound follows immediately by plugging 
in k = k* (which is possible since k* > k) and observing that infjp fct Pf — 
M j Mr j Pf = 0. □ 

PROOF of Theorem 8. Since <fi n (5) < u n (D(S)), conditions (i) and (ii) 

imply that, for all P G Vp^fiiF), 4> n {5) < Kn~ l l 2 5^~ . Then, by an easy 
computation, 



S n (t) < K 



1 \ 2 K +p-l / t \ 2k-1 t 

V - V- 



n \n n 



with some K > 0. It remains to recall that 5 n (t) > 5 n (t) and to use Theorem 1 
with t replaced by t + loglog,,n to get with some K > for all P G 'Pp, K ,c(^ r )) 
the bound 

¥{n^^£(f n ) > K(l + 1)} < e"*, 
which implies the result. □ 

Proof of Theorem 9. We use Theorem 7 to get for all P 
nPf-Pf* > Kd n (k*(P))} = 0(n- 2 ). 
Since for all P £Vj, k*(P) = j, we have 

max sup F{Pf - Pf, > K~5 n (j)} = 0(n~ 2 ). 

l<j<N Pe p. 
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The same argument as in the proof of Theorem 8 shows that 8 n (j) < Kn & 
Therefore 

max sup n^'E(P/ - P/*) < max n& sup P{P/ - P/* > Kn~^} + K 

l<j<N P&v l<j<N P&v 



<K + 0( max n^~ 2 ) = Oil). 



□ 



Proof of Theorem 10. We first look at a single class T of binary 
functions. The following upper bounds hold: 



D\F;5)= sup P(f-gy< sup (Pf + Pg) < 2 inf Pf + 5 



and 

(9.13) u n {T;8)<K 



/ Elog A JF (Xi, . . . ,X n ) | Elog A" r (Xi, . . . ,X n ) 



n n 
where the proof of the second bound can be found in [36]. It follows that 



<t>n{S)<K 



l2[MPf + 6 



E log A^ (Xi,...,X n ) , Elog A jr (Xi, . . .,X n 



+ ■ 



n n 
which implies, by using the Jj-transform, that with some constant K 



6 n (t) < K 
We now define 

L(t) :=K 
and 

S n {t) :=K 



. nf pf mogA^(X 1 ,...,X n ) + t + E\ogA :F (X 1 ,...,X n ) + t 
feF n n 



. nf logA^(X 1 ,...,X n ) + t + logA^(X 1 ,...,X n )+t 

f&F n n n 



. nf ElogA jr (Xi,...,X ra ) + t + ElogA^(X 1 ,...,X n ) + t 

f^T n n 



We use the following deviation inequality for shattering numbers due to 
Boucheron, Lugosi and Massart [12]: with probability at least 1 — e~* 

log A F {X 1 , . . . , X n ) < 2E log A F (X 1 , . . . , X n ) + 2t 



and 



Elog A^(Xi, ...,X n )< 21og A'(X U . ..,X n ) + 2t. 



Ft 



Using this device together with Lemma 3, it is easy to see that with proba- 
bility at least 1 — log g s -f L e~ t we have S n (t) < S n (t) < S n (t). For instance, to 
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prove the first of the two inequalities, note that, by the above deviation in- 
equality for shattering numbers, on an event of probability at least 1 — e~* we 
can replace in the bound on 5 n (t) E log A :F (Xi , . . . , X n ) by log A :F {X\ , . . . , X n ) . 
On the other hand, the first bound of Lemma 3 implies that with probability 
at least 1 — log q j^ye~* we have (using 2ab < a 2 + b 2 ) 



inf P/ < inf P n f + 25 n (t) + 2 J- inf Pf/2 + 1- 
T T y n T in 

2t 

< inf P n f + 2S n {t) + inf Pf/2 + -, 

which implies inf^-P/ < 2infjrP n / + 4S n (t) + At/n. Plugging this into the 
bound on 5 n (t) and replacing Elog A :r (Xi, . . . ,X n ) by log A :F {X\, . . . ,X n ), 
we easily get (with some constant K) 



5n(t)<K 



/ inf p \ogA^(X 1 ,...,X n ) + t | \ogA :F (X 1 ,...,X n )+t 

f&T n n n 



+ 2\ 



'*„(*) K 2 logAF(X 1 ,...,X n )+t 



2 2n 

which, again using 2ab < a 2 + b 2 , leads to the following bound (with some 
K): 

S n (t)<K 



I M p f logA^(X u ...,X n ) + t | \ogAF(X 1 ,...,X n )+t 



= 6n(t), 

which holds with probability at least 1 — log„ T L r^e~ t ■ The second inequal- 

ity $n(t) < 5 n (t) can be proved similarly. For a sequence .Pfc of classes of 
binary functions, this gives condition (5.2) and allows us to use Theorem 5 
to complete the proof. □ 

Proof of Lemma 5. First note that 

=E sup \(P n -P)(f-g)\<2E sup \{P n - P)(f - f)\. 

Also, / G -P(<5) implies that 



pp(f, f) < pp(f, /*) + pp(f, /*) < VI>(P/ - Pf*) + ^D{Pf - Pf, 



<JD(Pf-Pf) + 2JD(Pf-Pf, 



< y/D5 + 2v A DA < J2D(5 + 4A), 
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where A := Pf-Pf* = inf>P/ - Pf*. It follows that 



D(F;6)<2y/D(y/5 + 2yfA) and (j) n (S) < 29 n (y/ 2D(S + 4A)). 

As a consequence, recalling the definition of U n (5; t), we easily get with some 
constant C > for all e G (0, 1] 



U n (5;t) < C9 n U 2D(5 + AA)) + C\ + C(sA + —\ 

v \l n \ ne / 



where we used the inequality 2y DA^ < eA + ^ to bound the term _D(.F; S)y ~ 
involved in U n {5;t). Since 



K 2q 3 J " n '*V2g 3 

it is enough now to bound the {(-transform of 1^1,1(^2^3 separately and to 
use property 4 of Section 2.3. Let u := ^3. Then, by properties 3, 7, 8 of 
Section 2.3 

Also, (see property 6 with a = 1/2 and property 3) ip\{ u ) — C 2 Dt/{nu 2 ) and 
(property 5) 

, . C/ Dt\ 
1P3 (u )< — eA+ — . 
u V ne J 

As a result, property 4 now yields 

which after proper rescaling of e and adjusting the constants gives the bound 
of the lemma. □ 



Proof of Theorem 11. It is a straightforward consequence of The- 
orem 6, Remarks 2 and 4 after this theorem and Lemma 5. Note that one 
should choose tpk(u) = u 2 /Dk, which implies that cp*(v) = D^v 2 /4. The rest 
is an easy computation. □ 

Proof of Lemma 6. First of all, note that by Lipschitz condition (7.3) 
P\(£.g 1 )-(e.g 2 )\ 2 <L 2 \\g 1 -g 2 \\ 2 L2{n) . 
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Next, by (7.5), we have for g G Q, x G S, y G T 



£(y,g(x))+£(y,g(x)) 



>e[v, 



g(x)+g(x) 



+ ip(\g(x)-g(x)\ r ). 



Integrating this inequality and observing that G G and hence P(£ 



(2+2))>p(£.£) yields 



P{i»g) + P{l*g) 



>P(£*g)+ILi>(\g-g\ r ), 



or 



p(£. ff )-p(^ff)>2nv(|5-ffl r )- 



Now we can use Jensen's inequality, the monotonicity of tp, and the fact that 
\g — g\ < M to get 



5 P (.F; l.g)=P{Ug)- P(£ • 5) > 2V(n| 5 - g\ r ) > 2^(M r - 2 \\g - g\\l m ), 
which implies 



P(5)={(£.g):geg,£p(P;£.g)<5}c{(£.g):geg s } 

2 (n) 



where Q & := {g G Q : \g - g\\ 2 L (u) < M 2 ~ r ip~ 1 (5/2)}. Therefore 



D P (P;5)<L sup \\ 9l - 52 |U 2 (n) < 2LM 1 - r / 2 v /^- 1 (V2). 

We will now bound <^> n (<5) = 4> n (P;5) in terms of 6 n {6) = 6 n (Q;g;S). By 
the symmetrization inequality, 

^> n (5)=E sup |(P„-P)(/i-/ 2 )| 



< 2E sup 

Si,S2£G(<5) 



n 



' 1 J2eMY i ;g 1 (X l ))-£(Y i] g 2 (X t ))) 



i=l 



< 4E sup 



r? 



which by the contraction inequality can be bounded further by 



16LE sup 

965(5) 



n 



-^M^-giX,)) 



i=l 



< 16LEsup< 



n 



- l Y,e l {g{X l )-g{X l )) 



i=l 



■ g^Q, \\g-g\\^ 



(ii) 



< M 2 ~ r if)~ 1 (5/2) y 
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Using now the desymmetrization inequality yields 

<t> n {8) < 32LEsup{|(n n - U)(g -g)\:g€G, \\g - g\\l 2(u) < M 2 ' r (6 /2)} 



+ 8La 



'M 2 - r v _1 (V 2 ) 



n 



As a result, we can bound (with a proper choice of C) 
U n (S;t)<W n (6;t) 



C 



L6 n (M 2 - r ip- 1 (5/2))+L ] 



' M 2 - r ip~ 1 {5/2)(t + 1) t 
n n 



and the first bound follows. The second bound is also immediate because of 
property 2, Section 2.3. □ 

Proof of Theorem 12. We will apply the lemma with r = 2 and 
ip(u) = Au. Suppose that 6 n is upper bounded by a function 9 n of strictly 
concave type. In this case we have 



W n (5;t)<C 



L9 n (5/(2A))+L ] 



' Sjt + l) t 
2An n 



Using the basic properties of the Jj-transform it is easy to deduce that with 
some constant C 

L 2 t + l 



s?(S;t)<c 



2A9i 



+ 



A n 



Since Q := Mconv("H), where H is a VC-type class of functions from S 
into [—1/2,1/2], condition (2.1) holds for TL with envelope F=l. As in 
Example 4 of Section 2, 



0n(S)<e n (5) :=C 



£(i-p)/2 v . 



with p := yq^j. Such a 9 n is of strictly concave type and 6*1 (e) < C A ^y (1 
£ -2/(i+p) for e < 1. Therefore, 



M 2p/( l+p) 



S?(G;t)<C 



AM V ' { - V+ 



(L ,( V +2)/(V+l) _ 1Y± , 

VA Vl j n 2V+1 



+ 



LH + l 
A n 



= vr n (M,L,A;t). 

Assume now that for all y, £(y, •) is bounded by 1 on the interval [-M/2, M/2]. 
Applying Theorem 2, we get 

p{p(l»g)>mmP(£*g) + Tr n (M,L,A;t)\ <e~*. 
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To get rid of the assumption that I is bounded by 1, note that if £ is 
bounded by D on the interval [-M/2, M/2], one can replace I by £/D and 
also note that L,A become then L/D,A/D. Since Tv n (M,L/D,A/D;t) = 
n n (M,L,A;t)/D, the result follows by a simple rescaling. □ 
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